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Abstract 

This  thesis  introduces  Hidden  Process  Models  (HPMs).  HPMs  are  a  prob¬ 
abilistic  time  series  model  for  data  assumed  to  be  generated  by  a  set  of  pro¬ 
cesses,  where  each  process  is  characterized  by  a  unique  spatial-temporal  sig¬ 
nature  and  a  probability  distribution  over  its  timing  relative  to  a  set  of  known 
timing  landmarks.  Research  on  HPMs  has  been  inspired  and  motivated  by  the 
functional  Magnetic  Resonance  Imaging  (fMRI)  domain,  and  this  document 
presents,  develops,  and  evaluates  this  framework  in  the  context  of  fMRI.  We 
provide  the  HPM  formalism,  inference  and  learning  algorithms,  extensions  to 
the  basic  formalism,  a  discussion  of  the  correspondence  between  HPMs  and 
Dynamic  Bayesian  Networks,  experimental  results  evaluating  HPMs  on  real 
and  synthetic  fMRI  data,  and  examples  of  how  to  visualize  the  learned  mod¬ 
els.  We  conclude  that  the  HPM  extensions  incorporating  domain  knowledge 
about  the  process  signatures  are  important  for  analyzing  real  fMRI  data,  and 
suggest  future  improvements  to  the  model. 
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2.1  Sample  configurations  for  a  trial  in  the  sentence-picture  verification  ex¬ 
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Chapter  1 
Introduction 


This  thesis  is  inspired  by  high-dimensional  time  series  data  that  is  assumed  to  be  generated 
by  a  set  of  underlying  processes  whose  timings  are  partially  observed.  We  define  a  process 
to  have  a  spatial-temporal  signature  and  a  probability  distribution  over  the  times  when  it 
can  occur  relative  to  a  set  of  known  timing  landmarks  in  the  data.  We  are  interested 
in  simultaneously  estimating  the  signatures  and  timings  of  the  processes.  To  this  end, 
we  have  developed  Hidden  Process  Models  (HPMs),  which  we  introduce,  develop,  and 
evaluate  below. 

In  this  chapter,  we  motivate  our  efforts  with  the  functional  Magnetic  Resonance  Imag¬ 
ing  (fMRI)  domain,  discuss  related  work,  and  state  our  thesis. 


1.1  Motivation:  functional  Magnetic  Resonance  Imaging 

Functional  Magnetic  Resonance  Imaging  is  a  safe,  non-invasive  technology  that  can  be 
used  to  image  the  brain.  Since  the  spatial  resolution  of  fMRI  is  on  the  order  of  millimeters, 
and  the  temporal  resolution  is  on  the  order  of  seconds,  sequences  of  brain  images  taken 
while  a  person  in  the  scanner  performs  mental  tasks  can  provide  rich  data  for  cognitive 
scientists. 

fMRI  does  not  measure  neuronal  activity  directly;  instead,  it  measures  a  corellate  of 
neuronal  activity  based  on  blood  flow.  Since  oxygenated  and  deoxygenated  hemoglobin 
have  different  magnetic  properties  (oxygenated  hemoglobin  is  diamagnetic  and  deoxy¬ 
genated  hemoglobin  is  paramagnetic),  the  fMRI  scanner  can  detect  changes  in  the  relative 
amounts  of  these  two  molecules.  When  metabolic  function  in  a  brain  location  increases 
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(presumably  in  response  to  increased  neural  activity),  more  oxygenated  hemoglobin  is  de¬ 
livered  to  that  location.  The  additional  oxygenated  hemoglobin  along  with  its  subsequent 
conversion  to  deoxygenated  hemoglobin  constitute  an  effect  measurable  by  fMRI,  called 
the  Blood  Oxygenation  Level  Dependent  (BOLD)  response  or  the  hemodynamic  response. 
The  BOLD  response  is  sluggish  compared  to  the  time  scale  of  the  corresponding  neural 
activity;  less  than  a  second  of  neural  activation  can  cause  a  BOLD  response  lasting  several 
seconds.  While  this  may  seem  a  crude  approximation  of  the  neural  response,  many  studies 
have  shown  that  there  is  sufficient  signal  in  fMRI  data  for  distinguishing  among  various 
kinds  of  cognitive  states  (see  the  discussion  of  related  work  below  for  examples). 


1.2  A  Running  Example 

To  make  the  introduction  of  HPMs  more  clear,  we  will  use  concrete  examples  based  on 
the  following  experimental  paradigm  for  sentence-picture  verification.  We  have  both  real 
fMRI  data  (Keller  et  al.  [2001])  and  synthesized  data  for  this  paradigm  which  we  will 
describe  later  in  more  detail. 

In  our  datasets,  participants  were  presented  with  a  sequence  of  40  trials.  In  half  of  the 
trials,  participants  were  shown  a  picture  (involving  vertical  arrangements  of  the  symbols  *, 
+,  and  $)  for  4  seconds,  followed  by  a  blank  screen  for  4  seconds,  followed  by  a  sentence 
(e.g.  “The  star  is  above  the  plus.”)  for  4  seconds.  Participants  could  press  buttons  indi¬ 
cating  whether  the  sentence  correctly  described  the  picture  at  any  time  during  the  second 
stimulus  presentation.  The  participants  then  rested  for  15  seconds  before  the  next  trial  be¬ 
gan.  In  the  other  half  of  the  trials  the  sentence  was  presented  first  and  the  picture  second, 
using  the  same  timing.  Figure  1.1  diagrams  the  temporal  structure  of  the  trials. 
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Figure  1.1:  Experiment  structure  for  the  sentence-picture  verification  task.  In  20  trials,  the 
sentence  was  presented  first,  and  in  20  trials  the  picture  was  presented  first.  The  timelines 
show  that  each  stimulus  was  presented  for  4  seconds,  with  4  seconds  of  fixation  (looking 
at  a  cross)  between  the  stimuli.  Trials  were  separated  by  15  second  rest  periods.  Buttons 
could  be  pressed  any  time  during  the  second  stimulus  presentation  indicating  whether  the 
sentence  correctly  described  the  picture  (in  the  upper  example  it  does,  in  the  lower  example 
it  does  not).  Example  stimuli  are  shown  above  the  timeline. 
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1.3  Overview  of  Related  Work 


Two  major  problems  in  analyzing  fMRI  data  can  be  broadly  described  as  detection  and  es¬ 
timation.  By  detection,  we  mean  the  problem  of  determining  which  voxels  in  the  brain  are 
involved  in  the  response  to  a  particular  stimulus  or  experimental  condition.  By  estimation, 
we  mean  the  problem  of  modeling  the  hemodynamic  response  function  (HRF)  for  one  or 
more  stimuli  or  experimental  conditions.  The  two  problems  are  related,  since  detection  is 
often  done  by  comparing  the  observed  fMRI  data  to  the  convolution  of  a  hemodynamic 
response  function  with  the  time  course  of  the  experiment.  Hidden  Process  Models  address 
the  problem  of  estimation. 

Perhaps  the  most  common  approach  to  the  problem  of  estimating  the  HRF  is  the  Gen¬ 
eral  Linear  Model  (GLM),  in  which  the  data  is  modeled  as  a  linear  combination  of  the 
parameters  of  the  HRF  (to  be  estimated)  and  a  design  matrix  describing  the  timing  of  the 
experiment.  This  method  is  detailed  in  Dale  and  Buckner  [1997]  and  Dale  [1999].  The 
work  in  Dale  and  Buckner  [1997]  helped  push  fMRI  experiment  design  from  blocked  de¬ 
signs  (in  which  stimuli  are  presented  in  rapid  succession  and  the  data  is  averaged  over 
the  presentations  to  improve  the  signal  to  noise  ratio)  to  event-related  designs  (in  which 
stimuli  are  presented  and  analyzed  based  on  trials  of  different  types)  by  showing  that  the 
linearity  assumption  of  the  GLM  roughly  holds  for  visual  stimuli  spaced  as  little  as  two 
seconds  apart.  This  linearity  allows  the  selective  averaging  of  different  trial  types  to  es¬ 
timate  the  HRFs.  More  mathematical  detail  on  this  model  is  given  in  Dale  and  Buckner 
[1997].  Previously,  Boynton  et  al.  [1996]  also  investigated  the  validity  of  the  linearity 
assumption  commonly  made  in  fMRI  analysis  using  blocked  design  experiments  using 
blocks  of  different  length  presentations  of  visual  stimuli.  That  work  also  found  that  the 
response  to  a  longer  stimulus  can  be  reasonably  approximated  by  summing  and  shifting 
the  responses  to  shorter  stimuli.  Both  Boynton  et  al.  [1996]  and  Dale  and  Buckner  [1997] 
find  some  deviations  from  linearity  in  their  analyses,  but  both  conclude  that  it  is  a  rea¬ 
sonable  working  assumption  for  fMRI  analysis.  That  said,  researchers  have  also  studied 
the  nature  of  these  deviations  from  linearity  in  fMRI,  and  developed  alternative  models 
of  the  HRF  that  account  for  nonlinearities.  Friston  et  al.  [2000]  presents  a  nonlinear  es¬ 
timation  technique  for  HRFs.  Birn  et  al.  [2001]  discusses  the  spatial  heterogeneity  of  the 
nonlinearities. 

More  recently,  there  have  been  several  more  sophisticated  approaches  to  estimating 
hemodynamic  response  functions.  Dale  [1999]  discusses  an  extension  to  the  GLM  in 
which  the  HRF  parameters  are  modeled  as  weights  on  a  set  of  basis  functions,  and  Hossein- 
Zadeh  et  al.  [2003]  builds  on  that  approach  by  showing  how  to  choose  the  basis  functions 
using  Principal  Components  Analysis  (PCA)  (Jolliffe  [2002]).  The  PCA  is  done  over  in- 
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stantiations  of  a  canonical  HRF  form,  the  gamma  function,  that  vary  the  peak-time  and 
width  parameters  within  reasonable  ranges.  They  apply  their  method  to  the  problem  of 
detecting  significantly  active  voxels  and  find  that  it  is  more  sensitive  than  previous  ap¬ 
proaches  using  different  subspaces  to  model  the  HRF.  Another  extension  to  the  GLM  is 
described  in  Ciuciu  et  al.  [2003].  That  work  addresses  the  problem  of  estimating  HRFs 
for  events  that  are  not  synchronized  with  the  imaging  rate  (asynchronous  experiments), 
which  can  occur  for  instance  for  subject-initiated  events  with  variable  response  times.  The 
approach  puts  the  whole  regression  on  a  finer  temporal  grid,  on  which  both  the  image  ac¬ 
quisition  times  and  the  events  of  interest  are  defined.  The  Bayesian  estimation  procedure 
for  this  model  also  imposes  some  constraints  on  the  HRF  through  priors. 

An  effort  to  combine  the  detection  and  estimation  problems  in  a  single  framework  is 
described  in  Makni  et  al.  [2008].  That  paper  uses  an  HRF  model  similar  to  the  one  in  Ciu¬ 
ciu  et  al.  [2003],  and  similar  Bayesian  estimation  techniques.  Makni  et  al.  [2008]  divides 
the  brain  into  parcels  that  share  an  HRF  shape,  and  uses  an  autoregressive  noise  model. 
They  provide  an  algorithm  for  estimation  of  the  HRF  parameters  and  a  method  for  mak¬ 
ing  statistical  comparisons  across  conditions  in  the  framework.  This  represents  a  potential 
advance  over  treating  detection  and  estimation  sequentially,  but  it  is  still  constrained  to 
experimental  conditions  whose  timings  can  reasonably  be  assumed  known. 

Another  relevant  field  is  that  of  mental  chronometry,  which  is  concerned  with  de¬ 
composing  a  cognitive  task  into  its  component  processing  stages.  Traditionally,  mental 
chronometry  has  relied  on  behavioral  data  such  as  measured  reaction  times,  but  studies 
have  shown  that  functional  MRI  can  also  be  used  to  address  this  problem  (Menon  et  al. 
[1998],  Formisano  and  Goebel  [2003]).  Menon  et  al.  [1998]  addresses  the  question  of 
whether  fMRI  can  be  used  to  detect  latency  between  brain  areas.  In  a  visual  task,  the  au¬ 
thors  find  that  the  delay  between  presentations  of  stimuli  in  the  left  and  right  hemifields  is 
highly  correlated  with  the  onsets  of  the  HRFs  for  left  and  right  areas  of  the  visual  cortex. 
In  a  motor  task,  they  find  that  while  the  delay  between  the  HRF  onsets  from  the  supple¬ 
mentary  motor  area  to  the  main  visual  cortex  is  a  constant  of  the  task,  the  corresponding 
delay  between  the  visual  cortex  and  supplementary  motor  area  scales  linearly  with  reac¬ 
tion  time.  Since  these  types  of  studies  rely  on  the  relative  onsets  of  HRFs  (potentially  for 
varying  stimuli  and/or  brain  areas),  they  clearly  depend  on  accurate  methods  for  estimat¬ 
ing  the  delay  or  latency  of  the  HRF  from  an  event  of  interest.  Two  approaches  for  this 
problem  are  found  in  Henson  et  al.  [2002]  and  Liao  et  al.  [2002]. 

Returning  to  the  problem  of  detecting  which  voxels  are  involved  in  an  fMRI  exper¬ 
iment,  we  must  discuss  the  widely  used  Statistical  Parametric  Mapping  (SPM)  software 
(Friston  [2003]).  SPM  convolves  a  canonical  HRF  with  an  on-off  timecourse  of  fMRI 
stimuli  and  performs  statistical  hypothesis  tests  comparing  the  resulting  time  series  with 
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the  observed  data.  An  advantage  of  SPM  is  that  it  produces  maps  describing  the  activation 
throughout  the  brain  in  response  to  particular  stimuli.  A  disadvantage  of  SPM  is  that  it 
is  massively  univariate,  performing  independent  statistical  tests  for  each  voxel.  In  fact,  it 
has  been  shown  that  some  mental  states  cannot  be  detected  with  univariate  techniques,  but 
require  multivariate  analysis  instead  (Kamitani  and  Tong  [2005]). 

Two  other  time  series  analysis  methods  for  detecting  activity  in  fMRI  are  Lindquist 
et  al.  [2007]  and  Faisan  et  al.  [2007].  Lindquist  et  al.  [2007]  uses  techniques  from  sta¬ 
tistical  control  theory  and  change-point  theory  to  develop  a  Hierarchical  Exponentially 
Weighted  Moving  Average  (HEWMA)  model  that  is  particularly  well-suited  to  experi¬ 
ments  on  phenomena  that  cannot  be  easily  repeated,  like  studying  the  effects  of  a  short¬ 
acting  drug.  This  model  detects  whether  a  time  series  has  shifted  away  from  its  baseline 
mean,  and  if  so,  when  the  shifts  occurred.  Faisan  et  al.  [2007]  presents  hidden  Markov 
multiple  event  sequence  models  (HMMESMs).  HMMESMs  use  data  that  has  been  pre- 
processed  into  a  series  of  spikes  for  each  voxel,  which  are  candidates  for  association  with 
hemodynamic  events.  HMMESMs  use  an  optimized,  but  fixed  activation  lag. 

The  field  of  Machine  Learning  has  also  had  an  impact  on  fMRI  analysis.  The  use  of 
classification  techniques  from  the  field  of  machine  learning  has  become  widespread  in  the 
fMRI  domain  (see  Haynes  and  Rees  [2006]  for  an  overview).  Classifiers  have  been  used 
to  predict  group  membership  for  particular  participants  (e.g.  drug-addict  vs.  control,  in 
Zhang  et  al.  [2005]),  and  a  variety  of  mental  states  (Mitchell  et  al.  [2008],  Palatucci  and 
Mitchell  [2007],  Kamitani  and  Tong  [2005],  Mitchell  et  al.  [2004],  Cox  and  Savoy  [2003], 
Haxby  et  al.  [2001]).  These  classifiers  assume  that  the  mental  processes  being  classified  do 
not  overlap  in  time,  and  that  their  timings  are  fully  known  during  both  training  and  testing. 
Kay  et  al.  [2008]  describes  work  on  the  identification  of  natural  images  by  estimating  a 
receptive  field  model  for  voxels  in  the  visual  cortex.  After  the  model  is  trained  on  one  set 
of  images,  brain  activity  is  predicted  for  a  novel  set  of  test  images  and  the  image  with  the 
highest  correlation  with  each  observed  trial  is  selected. 

Dynamic  Bayesian  Networks  (DBNs)  are  another  class  of  models  widely  used  in  ma¬ 
chine  learning  for  time  series  analysis.  The  historical  record  of  DBNs  can  be  traced  back 
to  Dean  and  Kanazawa  [1988]  and  Dean  and  Wellman  [1991],  with  more  recent  contri¬ 
butions  in  Murphy  [2002].  Examples  of  DBNs  include  Hidden  Markov  models  (HMMs) 
(Rabiner  [1989])  and  factorial  Hidden  Markov  Models  (fHMMs)  (Ghahramani  and  Jordan 
[1997]),  which  are  particularly  relevant  to  Hidden  Process  Models. 

Finally,  this  thesis  is  partially  based  on  work  on  HPMs  reported  in  Hutchinson  et  al. 
[2006]  and  Hutchinson  et  al.  [2009].  Portions  of  Hutchinson  et  al.  [2009]  are  expanded 
upon  in  Niculescu  [2005]  and  Niculescu  et  al.  [2006]  which  use  a  simple  version  of  HPMs 
as  a  case  study  for  learning  Bayesian  networks  with  parameter  constraints.  In  particular, 
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Niculescu’s  work  showed  that  clustering  neighboring  voxels  into  groups  that  share  param¬ 
eters  can  improve  HPM  performance. 


1.4  Thesis  Statement 

The  thesis  of  the  research  reported  here  is  that  we  can  develop  machine  learning  techniques 
for  time  series  data  that  can  simultaneously  estimate  parameters  of  temporal  signatures  and 
the  onsets  of  the  latent  processes  that  generate  them.  These  techniques  can  be  improved 
by  incorporating  domain  knowledge,  and  they  can  be  used  to  express,  learn,  and  compare 
competing  models  of  the  processes  generating  the  data.  This  thesis  is  motivated  by,  and 
will  be  explored  within  the  fMRI  domain. 

To  support  this  thesis,  we  provide  results  on  synthetic  and  real  fMRI  data  using  Hidden 
Process  Models.  The  synthetic  data  results  show  that  HPMs  can  recover  the  parameters  of 
the  model  used  to  generate  the  data  in  an  ideal  situation,  and  that  we  can  use  held-out  data 
log-likelihood  to  identify  the  model  with  the  correct  number  of  latent  processes.  The  real 
data  results  indicate  that  the  standard  HPM  parameterization  and  algorithms  can  overfit 
when  faced  with  the  complexity  of  fMRI  data,  but  that  extensions  incorporating  domain 
knowledge,  like  the  temporal  smoothness  of  the  process  response  signatures,  can  overcome 
this  problem.  We  also  demonstrate  how  HPMs  trained  on  fMRI  data  can  be  compared 
using  cross-validated  log-likelihood,  and  how  they  can  be  visualized  for  exploratory  data 
analysis. 

To  our  knowledge,  HPMs  are  the  first  probabilistic  model  for  fMRI  data  that  can  es¬ 
timate  the  hemodynamic  response  for  overlapping  mental  processes  with  unknown  onset 
while  simultaneously  estimating  a  distribution  over  the  timing  of  the  processes.  We  hope 
that  this  research  can  help  pave  the  way  for  a  new  paradigm  of  fMRI  experiments  that  can 
explore  sets  of  cognitive  processes  more  complex  than  those  that  directly  correspond  to 
stimuli. 

The  rest  of  this  document  is  organized  as  follows.  In  Chapter  2,  we  introduce  Hidden 
Process  Models,  including  the  formalism  for  the  model  along  with  inference  and  learning 
algorithms.  In  Chapter  3,  we  present  three  extensions  to  the  basic  HPM  presented  in  Chap¬ 
ter  2.  In  Chapter  4,  we  detail  the  correspondance  between  HPMs  and  Dynamic  Bayesian 
Networks  (DBNs),  including  indications  of  how  one  could  adapt  DBN  algorithms  for  the 
specifics  of  HPMs.  Chapter  5  presents  experimental  results  on  the  performance  of  a  vari¬ 
ety  of  HPMs  from  Chapters  2  and  3  on  real  and  synthetic  fMRI  datasets.  Finally,  Chapter 
6  discusses  the  results  from  Chapter  5,  suggests  directions  for  future  work,  and  concludes. 
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Chapter  2 

Hidden  Process  Models:  Formalism  and 
Algorithms 


In  this  chapter,  we  present  the  basic  Hidden  Process  Model.  We  begin  with  a  formal 
statement  of  Hidden  Process  Models.  We  then  present  algorithms  for  doing  inference  with 
HPMs,  and  algorithms  for  learning  the  HPM  parameters  in  fully  observable  and  partially 
observable  settings. 


2.1  Formalism 

A  Hidden  Process  Model  is  a  description  of  a  probability  distribution  over  a  time  series, 
represented  in  terms  of  a  set  of  processes,  and  a  specification  of  their  instantiations.  HPMs 
assume  the  observed  time  series  data  is  generated  probabilistically  by  a  latent  collection 
of  process  instances.  Each  process  instance  is  active  during  some  interval  of  time,  and 
influences  the  observed  data  only  during  this  interval.  Process  instances  inherit  parameters 
from  general  process  descriptions.  The  timing  of  a  process  instance  depends  on  timing 
parameters  of  the  general  process  it  instantiates,  plus  a  fixed  timing  landmark  specific  to 
the  process  instance.  If  multiple  process  instances  are  simultaneously  active  at  any  point 
in  time,  then  their  contributions  sum  linearly  to  determine  their  joint  influence  on  the 
observed  data.  Figure  2.1  depicts  an  example  of  an  HPM  for  synthetic  fMRI  data.  We  also 
supply  a  table  of  notation  (Table  A.l)  to  aid  the  reader. 
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Processes  of 
the  HPM: 


Process  1:  ReadSentence 

Process  2:  ViewPicture 

Response  signature  W: 

Response  signature  W: 

vl 

Vl 

'  \) 

v  o 

v2 

Duration  d:  11  sec. 

vz 

Duration  d:  11  sec. 

Offsets  Q:  {0,1} 

Offsets  Q:  {0,1} 

P(Q):  {00,0,} 

P(Q):  {00,0,} 

Input 

stimuli: 


One 

configuration  c 
of  process 
instances 
{  i-p  ®2>  }  ■ 

Predicted  mean: 

□  vl 

■  v2 


Figure  2.1:  Example  HPM  for  synthetic  2-process,  2-voxel  data.  Each  process  has  a 
duration  d,  possible  offsets  O  and  their  probabilities  0,  and  a  response  signature  W  over 
time  (on  the  horizontal  axis)  and  space  (voxels  vl  and  v2).  They  are  instantiated  4  times 
in  the  configuration  c.  The  start  time  of  a  process  instance  i  is  its  landmark  A  plus  its  offset 
o.  The  predicted  mean  of  the  data  is  the  sum  of  the  contributions  of  each  process  instance 
at  each  time  point.  (The  response  signatures  are  contrived  rather  than  realistic  in  order  to 
more  clearly  show  linearity.) 
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More  formally,  we  consider  the  problem  setting  in  which  we  are  given  observed  data 
Y,  a  T  x  V  matrix  consisting  of  V  time  series,  each  of  length  T.  For  example,  these  may 
be  the  time  series  of  fMRI  activation  at  V  different  voxels  in  the  brain.  The  observed  data 
Y  is  assumed  to  be  generated  nondeterministically  by  some  system.  We  use  an  HPM  to 
model  this  system.  Let  us  begin  by  defining  processes: 

Definition  1  A  process  n  is  a  tuple  (W.  0,  0,  d).  d  is  a  scalar  called  the  duration  of  it, 
which  specifies  the  length  of  the  interval  during  which  i r  is  active.  W  is  a  d  x  V  matrix 
called  the  response  signature  of  n,  which  specifies  the  influence  of  n  on  the  observed 
data  at  each  of  d  time  points,  in  each  of  the  V  observed  time  series.  0  is  a  vector  of 
parameters  that  defines  a  multinomial  distribution  over  a  discrete-valued  random  variable 
which  governs  the  timing  of  instances  of  tt,  and  which  takes  on  values  from  the  set  Q  of 
integers.  The  set  of  all  processes  is  denoted  by  II. 

We  will  use  the  notation  0(7r)  to  refer  to  the  parameter  Q  for  a  particular  process  n. 
More  generally,  we  adopt  the  convention  that  f(x)  refers  to  the  parameter  /  affiliated  with 
entity  x. 

Each  process  represents  a  general  procedure  which  may  be  instantiated  multiple  times 
over  the  time  series.  For  example,  in  the  sentence-picture  fMRI  study  described  above,  we 
can  hypothesize  general  cognitive  processes  such  as  ReadSentence,  ViewPicture,  and  De¬ 
cide.  While  it  is  reasonable  to  assume  that  the  ReadSentence  and  ViewPicture  processes 
begin  exactly  when  their  corresponding  stimuli  are  presented  (in  this  case,  at  images  1 
and  17  in  the  time  series),  we  must  treat  the  onset  time  for  Decide  with  less  certainty. 
A  reasonable  assumption  might  be  that  Decide  begins  within  4  seconds  after  the  second 
stimulus  presentation.  To  reflect  these  assumptions,  we  can  define  the  processes  ReadSen¬ 
tence  and  ViewPicture  as  both  having  their  sets  of  possible  onsets  relative  to  the  stimulus 
presentation  as  f I  =  {0},  and  Decide  as  having  Q  =  (0, 1,  2, 3, 4,  5,  6,  7}.  To  model 
the  hemodynamic  response,  we  might  choose  a  12  second  interval,  which  corresponds  to 
d  —  24  time  points.  Q(Y)  and  c/(Y)  are  modelling  choices,  and  the  values  of  these  param¬ 
eters  for  all  processes  are  specified  a  priori  in  the  HPM.  The  parameters  W(7r)  and  0(7r) 
W  can  be  specified  a  priori  or  learned  from  data. 

For  the  sentence-picture  example,  each  process  would  be  instantiated  once  for  each 
trial.  The  instantiation  of  a  process  at  a  particular  time  is  called  a  process  instance,  defined 
as  follows: 

Definition  2  A  process  instance  i  is  a  tuple  (tt,  A,  O),  where  tt  identifies  a  process  as 
defined  above,  A  is  a  known  scalar  called  a  timing  landmark  that  refers  to  a  particular 
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point  in  the  time  series,  and  O  is  an  integer  random  variable  called  the  offset,  which  takes 
on  values  in  Q(V).  The  time  at  which  process  instance  i  begins  is  defined  to  be  A  +  O. 
The  multinomial  distribution  governing  O  is  defined  by  @(7 r).  The  duration  of  i  is  given 
by  dfn),  and  the  response  signature  is  W (7r). 


The  timing  landmark  A  is  a  time  point  in  the  data,  typically  associated  with  an  event 
in  the  experiment  design.  For  example,  the  timing  landmark  for  a  process  instance  in 
the  sentence-picture  verification  task  could  be  the  time  at  which  a  particular  stimulus  is 
presented.  A(i)  specifies  roughly  when  process  instance  i  will  begin,  subject  to  some 
variability  depending  on  its  offset  Oil),  which  in  turn  depends  on  its  process  idendity  Tt(i). 
More  specifically,  the  process  instance  i  starts  at  t  =  A (i)  +  0(i),  where  0(i )  is  a  random 
variable  whose  distribution  is  a  property  of  the  process  n(i).  In  contrast,  the  particular 
value  of  0(i)  is  a  property  of  the  instance.  That  is,  while  there  may  be  some  variation  in 
the  offset  times  of  different  ReadSentence  process  instances,  we  assume  that  the  amount 
of  time  between  each  sentence  stimulus  (the  timing  landmark  A)  and  the  beginning  of  its 
corresponding  ReadSentence  process  instance  follows  the  same  distribution,  as  specified 
in  the  ReadSentence  process. 

We  consider  process  instances  that  may  generate  the  data  in  ordered  sets,  rather  than  as 
individual  entities.  This  allows  us  to  use  knowledge  of  the  experiment  design  to  constrain 
the  model.  We  refer  to  each  possible  set  of  process  instances  as  a  configuration. 


Definition  3  A  configuration  c  of  length  I  is  a  set  of  process  instances  {/’  1  . . .  /’/-},  in  which 
all  parameters  for  all  instances  ((A(i),  0(i)},  i  =  (1 . . .  I})  are  fully- specified.  The 

set  ofcdl  configurations  is  given  by  C. 

Each  configuration  is  an  assignment  of  particular  value  s  to  the  variables  {  A  (i ) ,  tt  (i ) ,  O  (i ) } 
for  some  number  of  process  instances  I.  The  purpose  of  the  set  of  configurations  C  is  to 
reduce  the  hypothesis  space  of  the  HPM  in  a  reasonable  way.  We  can  think  of  the  hy¬ 
pothesis  space  of  an  HPM  as  the  number  of  ways  that  its  processes  can  be  instantiated  to 
influence  the  data,  which  is  the  set  of  all  possible  configurations.  In  general,  if  any  pro¬ 
cess  can  be  instantiated  at  any  time,  the  hypothesis  space  can  be  very  large.  To  see  this, 
consider  that  each  process  instance  has  T  possible  landmarks,  |II|  possible  process  IDs, 
and  |0(7r)|  possible  offsets  once  7r  is  chosen.  The  choice  of  process  ID  and  offset  are  not 
independent,  but  they  are  jointly  independent  from  the  choice  of  landmark,  resulting  in 
T  |H(tt)|  possibilities  for  each  process  instance.  If  we  have  /  process  instances,  they 
can  be  combined  in  (T  |^(7r)  I ) J  ways.  The  set  C  restricts  the  hypothesis  space  of  the 
model  by  allowing  the  HPM  to  consider  a  much  smaller  number  of  possibilities.  These 
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configurations  also  facilitate  the  incorporation  of  prior  knowledge  about  the  experiment 
design. 

For  example,  in  the  sentence-picture  verification  study,  if  we  use  the  processes  Read- 
Sentence,  ViewPicture,  and  Decide  as  discussed  above,  we  can  define  the  set  of  config¬ 
urations  given  in  Table  2.1.  These  configurations  reflect  the  assumptions  that  the  Read- 
Sentence  and  ViewPicture  processes  begin  exactly  when  their  corresponding  stimuli  are 
presented  (in  this  case,  at  images  1  and  17),  that  the  first  two  process  instances  are  Read- 
Sentence  and  ViewPicture  in  either  order  (but  two  instances  of  the  same  type  are  not  al¬ 
lowed),  and  that  the  third  process  instance  is  Decide  (but  we  do  not  have  any  additional 
information  about  its  timing  beyond  the  offset  values  defined  for  Decide).  This  set  of  con¬ 
figurations  is  for  a  general  trial.  To  customize  it  for  a  particular  trial,  we  can  remove  the 
configurations  in  which  the  identities  of  the  first  two  process  instances  do  not  match  the 
sequence  of  the  stimuli  in  that  trial.  As  written,  Table  2.1  restricts  us  to  16  of  a  potential 
(54(1  +  1  +  8))3  =  1.57464  *  108  configurations  of  length  3. 
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Table  2.1:  Sample  configurations  for  a  trial  in  the  sentence-picture  verification  experiment. 
S  =  ReadSentenlce,  P  =  ViewPicture,  and  D  =  Decide.  These  configurations  reflect  the 
assumptions  that  the  S  and  P  processes  begin  exactly  when  their  corresponding  stimuli 
are  presented  (in  this  case,  at  images  1  and  17),  that  the  first  two  process  instances  are 
S  and  P  in  either  order  (but  two  instances  of  the  same  type  are  not  allowed),  and  that 
the  third  process  instance  is  D  (but  we  do  not  have  any  additional  information  about  its 
timing  beyond  the  offset  values  defined  for  Decide).  If  we  wish  to  customize  this  set  of 
configurations  to  a  specific  trial,  we  can  remove  the  configurations  in  which  the  identities 
of  the  first  two  process  instances  do  not  match  the  sequence  of  the  stimuli  in  that  trial. 


14 


We  can  now  define  Hidden  Process  Models: 


Definition  4  A  Hidden  Process  Model,  HPM,  is  a  tuple  (II,  C,  (af  . . .  cry)},  where  II  is 
a  set  of  processes,  C  is  a  set  of  configurations,  and  a2  is  the  variance  characterizing  the 
Gaussian  distribution  governing  the  Vth  time  series  of  Y.  An  HPM  defines  a  probability 
distribution  over  the  obserx’ed  data  Y  as  follows : 

P(Y|HPM)  =  ^(Y|HPM,  C  =  c)P(C  =  c|HPM)  (2.1) 

esc 

where  C  is  the  set  of  configurations  associated  with  the  HPM,  and  C  is  a  random  variable 
defined  over  C. 

The  term  P(Y|HPM,  C  =  c)  is: 

/>(Y|HPM,C  =  c)  =  f  exp  (-\{Y  -  Mc))'Sr'(y  -  Me)))  (2.2) 

(27 r)  2  1 52 1  a  V  L  J 

where  Y  and  p(c)  are  column  vectors  of  length  VT,  p(c)  represents  the  mean  of  the 
Gaussian  defined  by  configuration  c,  and  S  is  a  VT  x  VT  matrix  whose  diagonal  is  T 
repetitions  of  erf,  followed  by  T  repetitions  ofay  and  so  on  through  a'y. 

The  term  P(C  =  c|HPM)  defines  a  prior  distribution  over  the  set  of  configurations. 

In  other  words,  P(Y\HPM)  is  a  mixture  of  Gaussians,  with  each  mixture  component  gen¬ 
erated  by  a  particular  configuration.  Note  that  the  set  of  configurations  C  is  defined  as  part 
of  the  HPM  and  that  it  is  fixed  in  advance. 

To  understand  this  definition,  first  consider  a  single  configuration  c  =  {i\ . . .  ii}  and 
a  single  observed  data  point  ytv.  The  probability  distribution  over  ytv  is  defined  by  the 
Gaussian  distribution: 

P(ytv\c,  HPM )  ~  A f(ptv(c),  a2v)  (2.3) 

where  a2  is  the  variance  characterizing  the  time-independent,  voxel-dependent  noise  dis¬ 
tribution  associated  with  the  vth  time  series,  and  where 

dOW) 

A hv(c)  =  Y.  Y,  6(\(i)  +  0(i)=t-T)WTV(n(i))  (2.4) 

i£c  r= 0 

Here  <5(-)  is  an  indicator  function  whose  value  is  1  if  its  argument  is  true,  and  0  otherwise. 
W  TV  (it (i))  is  the  element  of  the  response  signature  W  associated  with  process  7r(i),  for 
data  series  v,  and  for  the  rth  time  step  in  the  interval  during  which  i  is  instantiated. 
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Equation  2.4  says  that  for  a  given  configuration  c,  the  mean  of  the  Gaussian  distribution 
governing  observed  data  point  ytv  is  the  sum  of  single  contributions  from  each  process 
instance  whose  interval  of  influence  includes  time  t.  In  particular,  the  £(•)  expression  is 
non-zero  only  when  the  start  time  (A (i)  +  ()(i))  of  process  instance  i  is  exactly  r  time 
steps  before  t,  in  which  case  we  add  the  element  of  the  response  signature  W„  M*))  at 
the  appropriate  delay  (r)  to  the  mean  at  time  t.  This  expression  captures  a  linear  system 
assumption  that  if  multiple  processes  are  simultaneously  active,  their  contributions  to  the 
data  sum  linearly.  (To  some  extent,  this  assumption  holds  for  fMRI  data  (Boynton  et  al. 
[1996])  and  is  widely  used  in  fMRI  data  analysis.) 

Alternatively,  we  can  define  //to(c)  using  a  design  matrix  created  from  c  called  Xc, 
which  is  T  x  D,  where  D  =  d{ 7r).  The  tth  row  of  Xc  is  non-zero  only  in  the  columns 

corresponding  to  the  time  points  of  the  process  response  signatures  that  influence  that  time 
point.  If  responses  overlap  temporally,  there  will  be  more  than  one  non-zero  element  in 
the  row.  If  two  process  instances  of  the  same  process  begin  at  the  same  time,  Xc  will 
be  2  for  the  time  points  where  the  two  process  instances  occur.  Essentially,  Xc  maps  the 
time  points  of  the  process  response  signatures  onto  the  time  points  of  the  observed  data 
according  to  the  configuration  c.  Using  this  approach,  we  can  write: 

/i(c)  =  XCW  (2.5) 

where  W  (D  x  V)  refers  to  the  vertical  concatentation  of  the  W (V)  W,  and  //(c)  is  T  x  V. 

The  second  term  in  Equation  2. 1  is  the  prior  distribution  over  the  set  of  configurations. 
One  choice  for  this  distribution  is: 

P(C  =  c\HPM)  oc  P(/\n(i)  AO(i)\HPM) 

i£c 

cx  P(/\n(i)\HPM)  Y\_P(0(i)\n(i),HPM) 

i£c  i£c 

<X  C(A  n(i)\HPM)  I([  0o(t)(xW) 

i£c  i£c 

(2.6) 

In  the  basic  version  of  HPMs,  we  define  the  joint  probability  P(f\iec  7r (i)  | HPM)  to  be  a 
uniform  distribution  over  all  possible  combinations  of  process  IDs. 

Thus,  the  generative  model  for  an  HPM  involves  first  choosing  a  configuration  c  G  C, 
using  the  distribution  given  by  Equation  2.6,  then  generating  values  for  each  time  point  t 
of  each  time  series  v  using  the  configuration  c  of  process  instances  and  the  distribution  for 
P(Y\HPM,  C  =  c)  given  by  Equations  2.3  and  2.4. 
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Note  that  if  T  is  large,  the  number  of  configurations  C  can  be  very  large  even  for  a 
modest  amount  of  uncertainty.  In  some  cases,  we  can  store  fewer  configurations  by  ex¬ 
ploiting  conditional  independencies  in  the  structure  of  the  data.  fMRI  data  is  grouped 
into  N  trials,  which  are  typically  designed  to  allow  all  the  hemodynamic  responses  in  one 
trial  to  decay  before  beginning  a  new  trial.  In  domains  other  than  fMRI,  the  notion  of  tri¬ 
als  translates  to  windows  of  data  separated  by  periods  during  which  no  process  instances 
are  active  under  any  configuration.  This  structure  essentially  creates  conditional  indepen¬ 
dencies  in  the  model,  since  P(Yn\c)  only  depends  on  the  process  instances  of  c  that  are 
active  during  trial  n.  These  process  instances  can  be  separated  into  N  independent  sets  of 
configurations  Cn.  Equation  2.6  can  be  applied  to  Cn: 

P(Cn  =  cn\HPM )  =P(f\  n(i)\HPM)  J]  0o(i)(7r(i)) 

i&Cn  i&Cn 

Then  the  probability  of  a  configuration  over  all  trials  is: 

N 

P{c  =  {C! . . .  cn}\Y,HPM)  =  J]  P(Cn  =  cn|Yn,  HPM)  (2.7) 

n=l 

We  will  use  this  notion  of  trial- structured  data  to  explain  the  inference  algorithm  below, 
but  in  general  HPMs  can  describe  data  with  or  without  this  structure. 


2.2  HPM  Algorithms 

In  this  section  we  present  algorithms  for  doing  inference  and  learning  in  HPMs.  The  basic 
inference  problem  in  HPMs  is  to  infer  the  posterior  distribution  P(C  =  c\  Y,  HPM )  over 
the  configurations  C,  given  the  HPM  and  observed  data  Y.  The  learning  problem  in  HPMs 
is  to  leam  maximum  likelihood  estimates  of  the  HPM  parameters,  given  an  observed  data 
sequence  Y  and  a  set  of  configurations  C.  Let  T  =  (0(7r),  W (7t),  cr^}\/n,  Vv  be  the  set  of 
HPM  parameters  to  be  learned. 


2.2.1  Inference 


We  can  infer  the  posterior  distribution  P{C  =  c\  Y,  HPM )  over  the  configurations  C,  given 
the  HPM  and  observed  data  Y  with  a  straightforward  application  of  Bayes  theorem.  First, 
the  P(Cn  =  cn|Yn,  HPM)  within  trial  n  is: 


P(Cn  =  cn\Yn,HPM) 


P(Yn\Cn  =  cn,  HPM)P{Cn  =  cn\HPM) 
Ec;eCn  P(Yn\Cn  =  dn,  HPM)P(Cn  =  dn\HPM) 


(2.8) 
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where  Yn  is  the  data  for  trial  n,  and  where  the  terms  in  this  expression  can  be  obtained 
using  Equations  2.3,  2.4,  and  2.6.  Over  multiple  trials  we  have: 

N 

P(C  =  {Cl . . .  cn}\Y,HPM)  =  P(Cn  =  cn\Yn,HPM)  (2.9) 

n—  1 

Given  the  posterior  probabilities  over  the  configurations,  we  can  easily  compute  marginal 
probabilities  to  answer  more  general  questions.  For  instance,  we  can  compute  the  probabil¬ 
ity  of  the  identity  of  a  process  instance  by  summing  the  probabilities  of  the  configurations 
in  which  the  process  instance  in  question  takes  on  the  identity  of  interest.  More  concretely, 
suppose  we  wish  to  know  the  probability  that  the  second  process  instance  i2  in  trial  4  has 
identity  S.  This  quantity  is  given  by: 

P(C4(7T(i2))  =  S)  =  Y,  <Kc(M*2))  =  S))P(C4  =  c\Y4,HPM)  (2.10) 

where  again,  <5(-)  is  an  indicator  function  that  returns  1  if  and  only  if  its  argument  is  true. 

Note  that  other  marginal  probabilities  can  be  obtained  similarly  from  the  posterior 
distribution,  such  as  the  probabilities  of  particular  offsets  for  the  process  instances,  or  the 
joint  probability  of  two  process  instances  having  a  particular  pair  of  identities. 


2.2.2  Learning  from  Fully  Observed  Data 


First  consider  the  case  in  which  the  true  configuration  of  process  instances  is  fully  ob¬ 
served  in  advance  (i.e.  there  is  only  one  configuration  in  the  HPM).  For  example,  in  our 
sentence-picture  verification  experiment,  we  might  assume  that  there  are  only  two  pro¬ 
cesses,  ReadSentence  and  ViewPicture,  that  each  process  begins  exactly  when  its  corre¬ 
sponding  stimulus  does  (f2  =  {0}),  and  that  the  observed  sequence  of  stimuli  corresponds 
exactly  to  the  sequence  of  process  instance  IDs,  resulting  in  a  single  configuration. 

In  such  fully  observable  settings  the  problem  of  learning  @(7r)  reduces  to  a  simple 
maximum  likelihood  estimate  of  multinomial  parameters  from  observed  data: 


©n(vr) 


Eigc^W  =  7T  AO(i)  =  o) 
Eiec.o'enwo)  6 M*)  =  77  A  °(*)  =  °0 


(2.11) 


Alternatively,  we  can  use  an  m-estimate  (Mitchell  [1997])  to  update  the  estimate  of 

0(tt): 

Eiec^CO  =  vr  A  0(i)  =  o)]  +  mp 


®oM 


Eieco'enOrfl)  W*)  =  vr  A  0(i)  =  o')}  +  m 


(2.12) 
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Here  p  is  the  prior  on  0o(7r)  (e.g.  p  =  )  and  the  parameter  m  controls  the  weight  of 

the  prior  as  compared  to  the  training  data.  This  method  avoids  setting  some  of  the  0(7r) 
parameters  to  0  when  their  corresponding  offsets  are  not  observed  in  the  training  data, 
since  we  may  wish  to  allow  that  offset  with  non-zero  probability  in  the  test  set. 

The  problem  of  learning  the  response  signatures  W (7r)  is  slightly  more  complex,  be¬ 
cause  the  W(-7r)  terms  from  multiple  process  instances  can  jointly  influence  the  observed 
data  at  each  time  point  (see  equation  (2.4)).  Solving  for  W  reduces  to  solving  a  multiple 
linear  regression  problem  to  find  a  least  squares  solution.  Our  multiple  linear  regression 
approach  in  this  case  is  based  on  the  General  Linear  Model  approach  for  estimating  hemo¬ 
dynamic  responses  in  fMRI  data  as  described  in  Dale  [1999]. 

Since  there  is  only  one  configuration  c  of  process  instances  in  this  setting,  it  is  simple 
to  create  Xc  as  described  above  (see  Equation  2.5).  We  define  Xc  to  be  a  VT  x  VD 
matrix  whose  block-diagonal  is  the  T  x  D  matrix  Xc  repeated  V  times.  (The  off-diagonal 
blocks  are  all  zero,  since  they  would  essentially  map  the  process  response  signatures  for 
one  voxel  onto  the  data  for  another  voxel.)  A  summary  of  the  notation  used  in  this  section 
is  in  Table  A. 2. 

The  linear  system  we  are  modeling  is  then: 


Y  =  XckL  +  e 


(2.13) 


where  e  ~  Ar(0,  a2)  and  a2  refers  to  a  VT  x  1  vector  where  a2  is  tiled  appropriately. 
Minimizing  the  objective  function: 


fo(W)  =  \\Y-XJYWU 

=  (y  -  XeWys-^y  -  xcw) 


(2.14) 


where  X  has  a2  on  the  diagonal,  gives: 

W  <—  (Xe'E^Xe^Xe'E^y  (2.15) 

One  complication  that  arises  is  that  the  regression  problem  can  be  ill  posed  if  the 
training  data  does  not  exhibit  sufficient  diversity  in  the  relative  onset  times  of  different 
process  instances.  For  example,  if  processes  A  and  B  always  occur  simultaneously  with 
the  same  onset  times,  then  it  is  impossible  to  distinguish  their  relative  contributions  to 
the  observed  data.  In  cases  where  the  problem  involves  such  singularities,  we  use  the 
Moore-Penrose  pseudoinverse  to  solve  the  regression  problem. 
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Having  computed  the  maximum  likelihood  estimates  for  W,  it  is  easy  to  find  the  max¬ 
imum  likelihood  solution  for  the  a2u\ 


at 


T 


](ytv  P'tvi.c)') 


t= 1 


(2.16) 


where  /t(c)  =  XCW". 

2.2.3  Learning  from  Partially  Observed  Data 

In  the  more  general  case,  we  may  not  know  which  of  the  set  of  configurations  C  is 
correct,  and  we  face  a  problem  of  learning  from  incomplete  data.  The  configurations 
can  have  different  numbers  of  process  instances,  and  different  values  for  the  parame¬ 
ters  (7T (i),  A (i),  0(i)}Vi.  In  trial- structured  data,  the  configurations  C„  can  have  different 
lengths  for  different  n  if  we  have  more  uncertainty  about  some  trials  than  others.  If  we 
have  no  knowledge  of  the  experiment  design  at  all,  we  can  list  all  possible  combinations 
of  process  instances.  If  we  do  have  some  knowledge,  it  can  be  incorporated  into  the  set  of 
configurations  as  discussed  above. 

In  the  presence  of  uncertainty  over  configurations,  we  use  an  Expectation-Maximization 
(EM)  algorithm  (Dempster  et  al.  [1977])  to  estimate  the  parameters  T.  Let  us  call  the 
latent  variable  of  interest  Z,  a  binary  vector  of  length  C  indicating  which  of  the  configu¬ 
rations  is  correct,  where  J2CZC  =  1.  The  EM  algorithm  locally  maximizes  the  expected 
log-likelihood  of  the  complete  (observed  and  unobserved)  data,  which  we  call  Q: 

TOM)  —  -U^Y^old  [hiP(Y,Z|tt)]  (2.17) 

The  EM  algorithm  finds  parameters  T  that  locally  maximize  the  Q  function  by  iterating 
over  the  following  two  steps  until  the  change  in  Q  from  one  iteration  to  the  next  drops 
below  a  threshold,  or  until  a  maximum  number  of  iterations  is  reached: 

E  step:  Solve  for  Ez\Y}Vm  [Z],  where  EZ\Y ^oid[Zc]  =  P(C  =  c|Y,  vpold)  is  the  prob¬ 
ability  of  configuration  c  given  Y  and  'Told.  The  solution  to  this  is  given  by  Equation 
(2.9). 

M  step:  Use  Ez |Y,^°id  [Z\  from  the  E  step  to  obtain  new  parameter  estimates  for  T  that 
maximize  Q  (Equation  2.17). 

Below,  we  use  E[Z]  for  E Z\Y  ^0\a[Z]  to  simplify  notation.  We  additionally  introduce 
a  design  matrix  covering  all  configurations  X,  and  corresponding  parameters  Y,  Z,  and 
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X.  Let  M  =  V  Yhn  TnCn  be  the  number  of  rows  in  the  design  matrix  X  representing  all 
possible  configurations  for  all  trials.  The  other  variables  in  our  linear  system  need  to  be 
tiled  or  repeated  in  order  to  match  the  size  of  this  design  matrix,  X. 

X  is  constructed  in  the  following  way.  First  we  create  Xn  for  all  trials  n  by  vertically 
concatenating  the  Xnc  as  defined  above  for  all  c  G  n.  Then  X  is  the  vertical  concatenation 
of  the  Xn  for  all  n.  Xnc  is  VTn  x  VD,  Xn  is  VTnCn  x  VD,  and  X  is  M  x  VD. 

Y  is  the  M  x  1  vector  of  data,  created  by  vertically  concatenating  the  copies  of  Yn  for 
every  configuration  in  trial  n,  for  every  trial.  Z  is  a  M  x  1  vector  in  which  the  Znc  for 
each  configuration  is  repeated  for  the  rows  of  X  to  which  it  corresponds.  We  also  define 
Z  =  diag(Z).  Similarly,  X  is  an  M  x  M  covariance  matrix  in  which  a\  is  repeated  for 
every  element  of  M  that  refers  to  voxel  v.  a2  is  the  diagonal  of  X.  Finally,  we  define 
T  =  E-^Z]. 

Using  this  notation,  we  maximize  Q  by  minimizing: 

f0(W)  =  E[Z]\\Y-XW\\U  (2-18) 

The  solution  to  this  weighted  least  squares  problem  is: 

W* —  (x'Txj  1  X'TT  (2.19) 


The  full  derivation  for  Equation  2.19  is  given  in  Appendix  B.  Intuitively,  this  update 
simply  extends  the  least  squares  solution  for  observed  data  (Equation  2.15)  by  tiling  the 
design  matrix  so  that  it  includes  all  possible  configurations,  and  adding  weights  to  the 
optimization  problem  that  reflect  the  probability  of  each  configuration. 

Given  the  update  for  W,  the  updates  for  the  other  parameters  in  T  are: 


1  A  1  T’ 


a^N 


On, 0=0  <  ,r 


N 

X 


e  [azc]  (y  ntv  l^ntv  (  Cn  )  ) 

n=  1  n  t= 1  cneCn 

T,cneCn,i£cn  SW)  =7lA  °(i)  =  o)E[znc] 


N  ^  T,c'neC,i/ec'n,o'^(i'))5W)  =  *  A  0(i)  =  o')E[zn 


(2.20) 


(2.21) 


While  this  solution  for  updating  W  is  correct,  it  is  computationally  intensive.  For 
typical  fMRI  data,  the  value  of  M  =  V  E„Cn  (the  number  of  rows  in  the  design 
matrix  for  the  entire  experiment)  can  be  very  large.  In  our  sentence-picture  verification 
example,  V  zz  5000,  and  Tn  54.  The  number  of  configurations  per  trial  Cn  reflects 
our  uncertainty  about  the  processes  occuring  and  is  therefore  highly  variable.  For  the 
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reasonable  assumptions  represented  in  Table  2.1,  Cn  =  16  for  each  of  the  40  trials.  This 
yields  M  =  5000*40*54*16  ~  1.73*108.  Tomodela  12  second  hemodynamic  response 
for  each  of  3  processes,  we  would  use  D  =  d(n)  =  24  *  3  =  72  time  points  for  the 
response  signatures.  Therefore,  the  M  xVD  design  matrix  X  would  be  1.73  *  108  x  72. 

In  practice,  we  can  reduce  M  significantly  by  requiring  all  the  rows  of  X  to  be  unique. 
Each  row  of  X  corresponds  to  a  time  point  in  the  experiment,  and  it  is  easy  to  see  that 
multiple  configurations  can  ’share’  a  particular  explanation  of  a  time  point.  For  instance, 
the  first  2  configurations  in  Table  2.1  agree  on  the  explanations  for  time  points  1  through 
16,  and  only  begin  to  differ  at  time  point  17  on  which  time  point  of  the  D  process  influences 
the  data.  The  other  variables  in  the  least  squares  solution  for  W  can  be  scaled  similarly  to 
match  the  smaller  design  matrix. 


2.3  Conclusion 

In  this  chapter,  we  have  presented  the  basic  Hidden  Process  Model,  detailing  the  formalism 
of  the  model  and  several  algorithms.  We  showed  how  to  do  inference  over  a  set  of  configu¬ 
rations  of  process  instances  using  Bayes  rule.  We  explained  how  to  use  the  General  Linear 
Model  to  estimate  the  parameters  of  the  model  when  we  know  the  correct  configuration 
of  process  instances.  Along  with  the  supplemental  material  in  Appendix  B,  we  derived  an 
extension  of  the  GLM  to  estimate  the  HPM  parameters  under  uncertainty  about  the  correct 
configuration.  In  the  following  chapters,  we  will  explore  alternative  parameterizations  of 
the  process  response  signatures  and  process  offsets,  and  an  alternative  formalism  based  on 
Dynamic  Bayes  Networks. 
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Chapter  3 

HPM  Extensions 


In  this  chapter,  we  explore  three  extensions  to  the  basic  HPM  formalism  presented  above. 
The  first  extension  is  regularization  for  the  process  response  signatures,  which  incorpo¬ 
rates  bias  into  the  HPM  learning  algorithm  to  direct  the  parameters  toward  signatures  with 
desirable  properties  (e.g.  temporal  smoothness)  without  committing  to  a  parametric  form 
for  the  responses.  The  second  extension  models  the  process  response  signatures  using 
weights  on  a  set  of  basis  functions,  reducing  the  number  of  free  parameters  by  restricting 
the  model  to  the  pre-specified  subspace  defined  by  the  basis.  Both  of  these  extensions 
require  only  simple  modifications  to  the  learning  algorithm  presented  in  Chapter  2.  The 
third  extension  allows  the  hidden  offset  variables  to  be  continuous,  effectively  removing 
the  constraint  that  processes  must  start  on  the  temporal  grid  of  the  data  acquisition  rate. 
Relaxing  this  assumption  also  requires  we  commit  to  a  continuous  parametic  form  for  the 
process  response  signatures.  The  continuous  version  of  HPMs  is  also  sufficiently  more 
complex  than  the  HPMs  in  Chapter  2  that  we  introduce  a  sampling  algorithm  in  place  of 
the  EM  algorithm  described  above. 


3.1  Regularizing  the  Process  Response  Signatures 

The  parameterization  of  the  process  response  signatures  as  matrices  of  independently  es¬ 
timated  weights  to  be  summed  in  predicting  the  output  of  an  HPM  has  advantages  and 
disadvantages.  The  main  advantage  is  flexibility;  this  parameterization  can  capture  a  wide 
variety  of  functional  forms.  In  comparison,  a  model  that  commits  a  priori  to  a  particular 
form  of  the  response  (e.g.  a  canonical  HRF)  risks  the  possibility  that  the  form  may  be 
incorrect.  The  main  disadvantage  of  the  nonparametric  approach  is  the  large  number  of 
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parameters  to  be  estimated,  especially  as  compared  to  the  relatively  sparse  amount  of  data 
available.  In  some  experiment  designs,  this  parameterization  also  suffers  from  identifiabil- 
ity  issues,  where  there  may  be  an  infinite  number  of  solutions  to  the  parameter  estimation 
problem  that  produce  the  same  data  log-likelihood.  There  is  clearly  a  spectrum  from  most 
to  least  modeling  flexibility,  and  this  parameterization  will  often  be  too  flexible  to  produce 
interesting  results. 

One  way  to  address  this  problem  without  committing  to  a  (potentially  incorrect)  para¬ 
metric  form  for  the  process  response  signatures  is  to  add  regularization  penalties  to  the 
learning  algorithm.  These  penalties  bias  the  parameter  estimates  toward  biologically  plau¬ 
sible  properties.  We  can  do  this  by  adding  a  term  g(W)  to  the  objective  function  that  penal¬ 
izes  deviations  from  these  properties.  This  penalty  term  is  weighted  by  another  parameter 
7  which  specifies  how  much  importance  should  be  given  to  the  penalty  term  relative  to 
the  optimization  term  involving  the  data.  The  general  form  of  the  objective  function  we 
minimized  to  derive  the  algorithms  above  (from  Appendix  B),  with  regularization  penalty 
9(W),  is: 


fo(W)  =  E[Z\\\Y  -±W\\U+19{W) 

=  (Y  -  XWyE^ElZJiY  -  XW)  +  7 g(W) 


(3.1) 


High  values  for  7  increase  the  bias  toward  results  with  low  penalty  terms;  low  values  for 
7  allow  the  fit  to  the  data  to  be  more  influential. 

Notice  that  since  the  we  are  simply  adding  the  regularization  term  to  the  objective  func¬ 
tion  we  had  before,  we  can  simply  add  the  derivative  of  the  penalty  term  to  the  derivative 
we  had  before,  set  the  new  derivative  equal  to  0,  and  solve  for  W : 


dfo 

dW 


-2X'Y  Y  +  2X'YXIR  +  7 


dg 

dW 


(3.2) 


For  the  fMRI  domain,  there  are  several  biologically  plausible  properties  for  which  we 
may  want  to  regularize.  The  rest  of  this  section  suggests  penalty  terms  to  regularize  for 
temporal  smoothness,  spatial  smoothness,  spatial  sparsity,  and  spatial  priors.  In  each  case, 
we  will  write  penalties  in  terms  of  the  matrices  W(7r),  each  of  which  is  d{ 7r)  x  V,  or 
time  by  space.  We  will  continue  to  vectorize  these  matrices  for  the  parameter  estimation 
formulae  above,  but  the  penalties  are  easier  to  interpret  in  terms  of  the  individual  matrices. 
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3.1.1  Temporal  Smoothness 


Since  the  HRF  reflects  a  measurement  of  blood  flow  dynamics  in  the  brain,  we  would  like 
our  estimates  to  be  temporally  smooth.  To  impose  this  constraint  on  the  W(7r),  we  can 
use  a  squared  C2  penalty  on  the  differences  between  successive  time  points  for  each  voxel 
and  each  process: 

V  d(7r)— 1 

si  (wo  =  EE  E  W'W'  -  wM>,i)2  (3-3) 

7rElI  j= 1  i=  1 


The  derivative  with  respect  to  W  is: 

dgi 


dMV{p)v,a 


=  2(W (p)v,a  ~  w (p)v,a-i)  -  2(W(p)„,a+1  -  W(p)„,a) 
=  4W (p)Vta  -  2W (p)v,a- 1  -  2W (p)v,a+ 1 


(3.4) 


for  1  <  a  <  d(p )  (the  boundary  cases  each  have  only  one  of  the  a  +  1  or  a  —  1  terms). 


3.1.2  Spatial  Smoothness 


Since  voxels  that  are  close  to  each  other  will  have  similar  blood  flow  dynamics,  we  would 
like  the  estimates  of  the  HRFs  in  these  voxels  to  be  similar.  To  penalize  deviations  from 
spatial  smoothness,  we  first  assume  we  have  a  binary  adjacency  matrix  A  that  is  V  x  V, 
where  A (i,j)  =  1  if  and  only  if  voxel  i  is  adjacent  to  voxel  j  for  some  definition  of 
’adjacent.’  Suppose  we  wish  to  impose  spatial  smoothness  on  the  estimates  of  W,  meaning 
that  if  2  voxels  are  adjacent,  we  wish  to  penalize  deviations  between  the  timecourses  of 
their  process  response  signatures.  Again  we  can  use  a  squared  C2  norm,  but  this  time  the 
norm  is  over  differences  between  spatially  adjacent  timecourses: 

V-l  V  d(ir) 

MW)  =  E  E  E  A<!.3)  E<WM<.‘  -  wMi.*)2  (3-5) 

7tGI1  i=  1  j=i+ 1  k= 1 


The  derivative  with  respect  to  W  is: 

,  V  v-l 

92  =  2  £  A(vJ)(W(P)v,a-W(p)j,a)-2j2Mkv)(W(p)i,a-W(p)v^ 

j=v-\- 1  i=  1 

(3.6) 


dW(p)Vy 


for  1  <  v  <  V  (again  the  boundary  cases  have  fewer  terms). 
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3.1.3  Spatial  Sparsity 


By  spatial  sparsity,  we  mean  that  we  want  a  small  number  of  voxels  to  be  active  for  any 
given  process,  but  we  do  not  want  the  time  course  of  the  active  voxels  to  be  sparse.  That 
is,  we  want  a  sparsity  penalty  on  the  columns  of  W  but  not  on  the  rows.  One  way  to 
achieve  this  penalty  is  with  a  (2,l)-norm  (Argyriou  et  al.  [2008]  uses  this  norm  for  a 
similar  problem  in  multi-task  learning): 


v 


g3(w )  =  ££.\, 
ten  i= 1  \ 


d(n) 

£ 

j= 1 


W  (it)2 


h3  1 


(3.7) 


The  derivative  with  respect  to  W  is: 


dg3 

dW(p)v,a 


w(p).,Sj»(/E*!w(p)y 


Eg  w(p); 


(3.8) 


where 

sgn(x)  = 


(3.9) 


3.1.4  Spatial  Priors 

In  some  cases,  we  may  have  prior  knowledge  about  the  regions  of  the  brain  that  will  be 
most  active  during  a  particular  cognitive  process.  For  example,  we  expect  that  visual 
stimuli  will  produce  a  response  in  the  visual  cortex,  but  minimal  activity  in  other  parts  of 
the  brain.  In  this  situation,  we  can  encode  our  prior  knowledge  in  a  W/,r'or .  We  can  then 
use  regularization  to  penalize  deviations  from  our  expectations,  again  with  a  squared  Co 
norm: 

g*(w) -TV  -  wMsfy  (3.io) 

7rElI  i— 1  j= 1 


dg 4 

dW  (p)v,a 


2(w  {P)v,a  -  w  (p)^-) 


(3.11) 
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3.1.5  Hybrid  Penalties 

Any  of  the  previous  penalties  can  be  combined  with  arbitrary  weights  to  introduce  multiple 
biases  simultaneously: 


g(W)  =  i9i{W) 


dg 

dW 


Edgt 


i 


(3.12) 


where  the  values  of  the  7 ,  reflect  the  relative  importance  of  each  penalty. 

In  Chapter  5,  we  present  experimental  results  showing  that  regularizing  the  process 
weights  for  spatial  and  temporal  smoothness  improves  upon  the  performance  of  the  unreg¬ 
ularized  models. 

While  these  regularization  schemes  avoid  committing  to  particular  forms  of  the  re¬ 
sponse  signatures,  they  still  require  a  large  number  of  parameters.  In  addition,  another 
step  must  be  added  to  the  estimation  process  to  choose  the  regularization  weights  (the  7 
values),  though  these  can  be  hand-tuned  and/or  fixed  in  advance.  In  the  next  section,  we 
describe  a  way  to  reduce  the  number  of  parameters  in  the  model  while  sacrificing  only  a 
modest  amount  of  flexibility. 


3.2  Basis  Functions  for  the  Process  Response  Signatures 

In  this  section,  we  introduce  a  parameterization  of  the  process  response  signatures  using 
a  set  of  basis  functions.  We  do  not  address  the  problem  of  learning  a  good  basis  set; 
rather,  we  assume  the  set  of  basis  functions  is  fixed  in  advance.  As  we  will  see,  this 
parameterization  can  reduce  the  number  of  parameters  to  be  estimated  and,  if  the  basis  is 
chosen  carefully,  restrict  the  model  to  a  subspace  with  desirable  properties. 

Instead  of  using  independent  parameters  for  the  time  points  of  process  response  signa¬ 
tures,  we  can  represent  the  contribution  of  each  process  to  the  observed  output  as  a  linear 
combination  of  a  set  of  basis  functions.  The  new  parameters  of  the  process  response  sig¬ 
natures  are  a  set  of  B  coefficients  for  each  voxel,  with  each  coefficient  corresponding  to 
one  of  B  basis  functions.  We  represent  the  basis  functions  themselves  in  a  pre-de fined 
matrix  H,  which  is  d  x  B,  where  is  the  value  of  basis  function  b  at  time  step  d.  We 
can  then  replicate  this  matrix  along  the  block  diagonal  of  a  new  matrix  H,  which  will 
be  VD  x  BPV,  where  P  is  the  number  of  processes.  We  represent  the  coefficients  as  a 
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BPV  x  1  vector  (5.  (Here,  we  assume  that  all  processes  have  the  same  duration  d,  which 
is  also  the  length  of  the  basis  functions.)  Then  the  VD  x  1  vector  W  can  be  written  as: 


W  =  H/3  (3.13) 

This  simple  change  translates  easily  to  the  M  step  update  derived  in  Appendix  B,  with  the 
update  for  /3  now  given  by: 


(3  <-  (H'X'TXHj-'H'X'TT  (3.14) 

There  are  two  main  advantages  to  this  reparameterization  of  the  response  signatures. 
Firstly,  by  choosing  smooth  basis  functions  we  can  impose  temporal  smoothness  on  the 
process  responses  and  predicted  means.  We  can  also  force  the  beginning  and  end  of  the 
responses  to  tend  to  zero  by  choosing  the  basis  functions  to  begin  and  end  near  zero. 
Secondly,  if  we  choose  the  number  of  basis  functions  B  to  be  smaller  than  the  number 
of  independent  time  points  d  in  W,  we  can  reduce  the  number  of  parameters  in  the  HPM. 
Where  previously  each  process  required  d  parameters  for  each  voxel  (for  each  time  point  of 
the  response),  using  basis  functions  requires  only  B  parameters  for  each  voxel  (coefficients 
for  each  basis  function).  For  example,  in  the  sentence-picture  verification  study  where  we 
acquired  2  images  per  second,  we  would  need  D  —  24  to  model  a  12  second  hemodynamic 
response  function.  If  instead  we  use  3  basis  functions  to  model  this  experiment,  we  can 
estimate  |  of  the  number  of  parameters  in  the  previous  model. 

The  obvious  remaining  question  is  how  to  choose  the  number  and  form  of  the  basis 
functions.  In  general,  choosing  the  optimal  basis  set  for  fMRI  data  is  outside  the  scope  of 
this  thesis.  The  model  can  generally  be  expected  to  perform  well  with  basis  functions  if  the 
basis  can  represent  the  true  process  response  signatures;  if  the  basis  cannot  represent  the 
true  model  then  the  nonparametric  approach  presented  above  (especially  the  regularized 
version)  may  be  a  better  option.  One  approach  to  modeling  the  HRF  with  basis  functions 
is  presented  in  Hossein-Zadeh  et  al.  [2003].  That  work  suggests  choosing  a  plausible 
functional  form  for  the  HRF  with  some  number  of  parameters  (they  use  a  gamma  function 
with  two  parameters).  The  HRF  is  then  instantiated  many  times,  for  varying  settings  of  the 
parameters  within  some  reasonable  range.  They  construct  a  P  x  N  matrix  Q  consisting  of 
P  realizations  of  the  HRF  using  different  parameter  settings,  evaluated  at  N  time  points. 
They  perform  PCA  on  Q,  and  use  the  first  M  principal  components  as  basis  functions 
(they  use  M  =  3).  This  approach  satisfies  two  general  criteria  to  consider  when  choosing 
a  basis  set  for  HPMs:  that  the  basis  be  able  to  represent  a  relevant  class  of  functions  (i.e. 
believed  to  contain  the  process  response  signatures),  and  that  the  basis  be  small  enough  to 
reduce  the  number  of  parameters  in  the  model  (i.e.  fewer  than  d  basis  functions). 
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In  practice,  HPMs  parameterized  with  basis  functions  generated  using  the  Hossein- 
Zadeh  et  al.  [2003]  approach  performed  similarly  to  regularized  HPMs,  but  took  less  time 
to  train.  Experimental  results  on  these  models  are  presented  in  Chapter  5. 


3.3  Continuous  HPMs 

In  the  HPM  formalism  presented  so  far,  the  implicit  assumption  has  been  that  processes 
begin  at  time  points  on  the  temporal  grid  defined  by  the  data  acquisition  rate.  This  con¬ 
straint  results  from  the  discrete  parameterization  of  both  the  offset  values  and  the  process 
response  signatures.  The  time  points  of  the  process  response  signatures  are  assumed  to  cor¬ 
respond  to  time  points  in  the  discrete  data,  and  the  offsets  are  integers  added  to  landmarks 
that  are  also  defined  on  the  temporal  grid  of  the  data.  Landmarks  that  occur  between  data 
points  must  be  rounded  to  the  nearest  data  point  to  align  the  process  response  signatures 
correctly.  In  fMRI,  the  assumption  that  cognitive  processes  begin  on  the  same  tempo¬ 
ral  grid  as  the  image  acquisitions  is  clearly  false.  In  this  section,  we  seek  to  eliminate 
this  assumption  and  allow  process  start  times  (and  therefore  process  offset  values)  to  vary 
continuously. 

In  the  model  presented  in  Chapter  2,  which  we  will  refer  to  as  the  ’discrete  HPM’ 
below,  the  legal  discrete  offset  values  for  each  process  tt  were  listed  in  the  vector  O  (zr) ,  and 
their  corresponding  multinomial  parameters  were  in  @(7r).  To  allow  continuous  process 
start  times,  we  must  allow  continuous  offset  values  for  the  processes.  The  offset  values  for 
a  process  could  either  be  bounded  by  a  pre-specified  closed  interval  [l(i r),  h( 7r)],  or  they 
could  be  unbounded.  In  the  first  case,  one  possible  modeling  choice  is  a  Beta  distribution 
over  that  interval,  with  parameters  a{n)  and  (3 (it)  taking  the  place  of  @(7r).  The  second 
case  is  even  more  general,  and  one  example  of  an  appropriate  probability  distribution  here 
is  a  Gaussian  with  parameters  /i(tt)  and  ct2(7t).  In  either  case,  each  process  defines  a 
distribution  P(o(i) |vr(i);  @(7r))  where  @(7r)  is  used  generally  to  denote  process-specific 
parameters  of  whatever  distribution  is  being  used  (multinomial,  Beta,  Gaussian,  etc.). 

This  reparameterization  of  offsets  requires  a  redefinition  of  the  notion  of  configura¬ 
tions.  Whereas  configurations  in  the  discrete  model  enumerated  a  process  ID,  landmark, 
and  offset  for  each  process  instance,  configurations  in  the  continuous  model  only  enu¬ 
merate  values  for  the  process  ID  and  landmark,  which  are  then  combined  with  a  hidden, 
continuous  offset  random  variable.  For  example,  the  hypothesis  space  defined  by  the  con¬ 
figurations  listed  in  Table  2.1  for  the  discrete  model  corresponds  to  the  configurations 
shown  in  Table  3.1  for  the  continuous  model.  We  write  oci  to  mean  the  hidden  offset 
variable  for  process  instance  i  under  configuration  c.  Note  that  we  have  only  two  configu- 
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rations  here  as  compared  to  16  in  the  discrete  case. 


c 

vr(ii) 

A(ii) 

0(h) 

vr(i2) 

A(i2) 

0(h) 

^(h) 

*(*3) 

0(H) 

1 

S 

1 

Oil 

P 

17 

Ol2 

D 

17 

013 

2 

P 

1 

021 

S 

17 

022 

D 

17 

023 

Table  3.1:  The  set  of  configurations  in  continuous  HPMs  that  corresponds  to  the  set  of 
discrete  HPM  configurations  in  Table  2.1.  The  oci s  are  hidden  random  variables  governed 
by  the  distributions  specified  by  their  corresponding  processes. 

These  parameterized  configurations  introduce  another  modeling  difference  between 
discrete  and  continuous  HPMs.  While  both  models  have  a  set  of  configurations  to  explic¬ 
itly  define  the  hypothesis  space  of  the  model,  there  are  some  types  of  process  ordering 
constraints  that  can  be  represented  with  the  discrete  HPM  configurations  but  not  with  the 
parameterized  continuous  HPM  configurations.  To  see  this,  consider  two  process  instances 
that  we  wish  to  constrain  to  a  particular  ordering,  even  if  their  intervals  of  legal  start  times 
(A  +  o )  overlap.  This  might  be  the  case  for  processes  like  Decide  and  PressButton  in  the 
sentence-picture  verification  example;  both  of  these  processes  should  begin  shortly  after 
the  second  stimulus  is  presented,  but  Decide  should  begin  before  PressButton.  In  con¬ 
structing  configurations  for  a  discrete  HPM,  we  enumerate  all  possible  combinations  of 
the  offsets,  and  we  can  then  eliminate  any  configurations  containing  combinations  with  an 
illegal  ordering.  On  the  other  hand,  continuous  HPMs  introduce  random  variables  for  each 
of  these  offsets,  and  they  are  drawn  independently  of  one  another,  from  the  distributions 
specified  by  their  corresponding  processes.  There  is  no  constraint  to  say  that  one  offset 
value  must  be  larger  or  smaller  than  another.  A  modeling  tradeoff  then,  is  that  continuous 
HPMs  allow  start  times  to  be  removed  from  the  temporal  grid  of  the  observations  at  the 
expense  of  the  ability  to  specify  strict  process  orderings  in  the  hypothesis  space  for  some 
cases. 

Continuous  offset  values  and  start  times  also  imply  the  need  for  continuous  process 
response  signatures,  since  we  will  need  to  be  able  to  evaluate  the  response  signatures  at 
the  times  of  the  observed  data  points,  regardless  of  how  those  time  points  relate  to  the 
continuous  start  time.  Essentially,  we  need  a  continuous  parametric  form  for  the  process 
response  signature,  which  we  will  call  h(x;  ui(ir)),  whose  contribution  to  an  observed  data 
point  at  time  t  is  h(t  —  s;cu(7r)).  Here  s  is  the  continuous  start  time  of  an  instance  of 
process  n,  and  contains  process-specific  parameters  of  h.  h(x;u>( n))  could  have 
a  pre-specified  duration  d  as  in  the  discrete  HPMOne  example  of  a  possible  form  for 
h(x;  uj ( 7T ) )  is  a  gamma  function  with  3  parameters,  where  the  parameters  can  vary  across 
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voxels  (h(x-,ou(n,v))): 


h(x ;  w(tt)) 


(gr-iexp(-g) 

P(V~  !)! 


(3.15) 


where  //,  p,  and  77  are  defined  for  each  process  and  each  voxel  ({z/(7r,  v),p{ir,  v),rj(n,  n)}). 
1/  controls  the  amplitude  of  the  response,  p  controls  the  width  of  the  peak,  and  //  controls 
the  delay  of  the  peak.  (This  is  the  functional  form  of  the  impulse  response  used  in  Boynton 
et  al.  [1996].) 


So  far,  we  have  purposefully  left  the  specific  forms  of  the  continuous  distribution  over 
offsets  and  the  process  response  signatures  unspecified  because  the  framework  of  continu¬ 
ous  HPMs  is  sufficiently  general  to  allow  for  a  variety  of  modeling  choices  in  these  areas. 
The  general  form  of  the  joint  distribution  specified  by  continuous  HPMs  can  be  written  as: 


P(0,  C,  Y)  =  P(0)P(C\0)P(Y\C,  O) 


where  O  represents  the  set  of  all  hidden  offset  variables,  C  represents  a  random  variable 
defined  over  configurations,  and  Y  is  the  observed  data. 

We  now  further  specify  that  the  individual  offset  variables  are  marginally  independent: 

p(o)=nm°«w))  (3-i6) 

ceC  iec 

where  P(oCi\n(i))  is  the  same  as  the  P(o(i)\n(i)]  @(7 r))  defined  above  for  each  process 
but  with  the  parameters  suppressed  for  brevity  (i.e.  this  could  be  a  Beta  or  Gaussian 
distribution  depending  on  modeling  choices). 

Similarly  to  discrete  HPMs,  we  specify  the  probability  of  the  configurations  as  being 
proportional  to  the  likelihood  of  their  components: 

P(C  =  c\0)  oc  P(Ai£cn(i))  ]^[  P(oCi\n(i))  (3.17) 

iGc 

As  before,  we  will  treat  the  joint  distribution  over  process  IDs  as  uniform,  essentially 
leaving  us  with: 

P(C  =  c\0)  oc  JJP(0ci|7r(i))  (3.18) 

iGc 


Finally,  the  distribution  over  the  observed  data  given  the  hidden  variables  remains 
Gaussian: 


V  N 

PMc=c,0=.)=nn 

v=l  n—  1 


T(n) 

V2nav 


T(n) 

IJ  exp 
t=  1 


lhv{ci  P) ) 


(3.19) 
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where 


(3.20) 


A Hv(c,o)  =  y,  hit  -  A (i)  -  oci-,u>(n(i),v)) 
i£c 

The  generative  model  corresponding  to  these  equations  works  as  follows.  Given  a  set 
of  parameterized  configurations  and  a  set  of  processes,  we  first  draw  values  for  the  offset 
variables  for  every  configuration.  That  is,  for  each  process  instance  i  in  each  configu¬ 
ration  c,  we  draw  oci  using  the  process  parameters  indicated  by  the  process  ID  for  this 
instance  (vr(A)).  Once  the  configurations  are  fully  instantiated  with  offset  values,  we  draw 
a  value  for  the  correct  configuration  c  from  a  multinomial  distribution  whose  parameters 
are  proportional  to  the  likelihood  of  the  offsets  in  each  configuration.  Now  that  we  have  a 
particular  configuration,  we  can  predict  the  mean  of  the  Gaussian  over  the  observed  data 
by  summing  the  process  responses  of  all  the  instances  in  c  over  time,  again  using  the  pa¬ 
rameters  indicated  in  the  configuration  for  their  process  IDs.  A  graphical  model  showing 
a  continuous  2-process  HPM  for  two  trials  is  shown  in  Figure  3.1. 
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Figure  3.1:  Graphical  model  for  a  continuous  HPM  with  two  trials  of  data  and  two  pro¬ 
cesses.  Shaded  nodes  are  observed;  square  nodes  are  discrete,  round  nodes  are  continuous. 
All  elements  of  the  configurations  except  the  oci  are  fixed  in  advance.  6  parameterizes  an 
offset  distribution,  u  parameterizes  the  process  response  signature,  a  contains  a  noise 
value  for  every  voxel,  Cn  is  a  random  variable  over  configurations  for  trial  n,  and  Yn  is 
a  T  x  V  data  matrix  for  trial  n.  The  set  of  configurations  for  each  trial  are  the  same  in 
this  example;  in  general  they  could  be  different.  In  some  cases  the  trial  subscript  n  is 
suppressed  for  simplicity  (i.e.  oci  is  really  onci  so  that  the  random  variables  for  the  offsets 
in  the  two  different  trials  are  distinguishable). 
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Regardless  of  the  specific  distribution  chosen  to  model  P(oi\n(i)]  (9 (tt ) )  and  the  spe¬ 
cific  form  of  h(x;  01(71,  v )),  the  shift  from  discrete  to  continuous  offsets  introduces  several 
intractable  integrals  into  the  EM  derivation  presented  in  Appendix  B.  We  therefore  intro¬ 
duce  a  different  algorithm  for  inference  and  learning  in  continuous  HPMs:  Metropolis 
sampling.  Mac  Kay  [2003]  and  MacKay  [1998]  provide  excellent  introductions  to  Markov 
Chain  Monte  Carlo  (MCMC)  methods,  of  which  Metropolis  sampling  is  one  example. 

The  concept  of  the  Metropolis  algorithm  is  fairly  simple.  Suppose  we  are  given  a 
joint  probability  distribution  P(Y,  X]  9)  over  observed  variables  Y  and  hidden  variables 
X.  Furthermore,  let  us  treat  any  unknown  parameters  9  the  same  as  hidden  variables 
(so  X  =  (A",  9}).  Assuming  that  we  can  evaluate  P(Y,  X)  (or  that  we  can  evaluate 
it  except  for  the  normalization  constant),  we  can  draw  samples  from  P(X\Y)  and  use 
statistics  of  these  samples  to  make  statements  about  the  values  of  the  hidden  variables 
and  unknown  parameters.  The  Metropolis  algorithm  starts  from  an  initial  value  X(t>)  and 
proposes  a  change  X(prop>  from  a  proposal  distribution  ().  For  example,  Q  might  be  a 
Gaussian  distribution  with  fixed  variance  centered  on  the  current  point  X(°\  This  new 
sample  X(vrop)  is  accepted  with  probability: 


/  P(Y,  X^r°ri )  Q  (X (°) ;  X^0^ )  \ 
V1,  P(Y,  AT°))  g(XM;A(°)) ) 


(3.21) 


If  the  proposal  distribution  Q  is  symmetric  (like  a  Gaussian),  then  Q(X(>lp  Xlf,ropy)  = 
Q(X(proT9-,  X((y>)  and  this  simplifies  to: 


a  =  min 


P(Y ,  X(prop))\ 

P(Y,X(°))  J 


(3.22) 


This  equation  says  that  if  Xiprop>  increases  the  joint  likelihood,  it  is  accepted  with  prob¬ 
ability  1.  If  it  decreases  the  joint  likelihood,  the  acceptance  probability  depends  on  the 
ratio  of  the  likelihoods  under  the  current  and  proposed  values  for  X.  If  X(prop)  is  rejected, 
X(0)  is  added  to  the  list  of  samples  again.  We  continue  to  generate  samples  X(t+l  >  in  this 
manner  to  acquire  some  number  of  samples.  This  process  defines  a  Markov  chain  that  is 
guaranteed  to  converge  to  the  true  distribution  P(X\Y)  as  t  — >  00. 

More  concretely,  the  hidden  variables  and  parameters  to  be  sampled  in  a  continuous 
HPM  are  X  =  {9(7i),oj('n,v),a-2(v),oriCi,C(;n)}\/Ti,v,n,c,i.  This  assumes  we  are  in¬ 
terested  in  learning  the  parameters;  if  the  parameters  have  already  been  estimated  or  are 
assumed  to  be  known  instead,  we  can  add  them  to  the  observed  variables  Y  instead  and 
sample  only  the  hidden  variables.  Fet  us  define  a  fully  factorized  proposal  distribution  Q 
consisting  of  an  independent  Gaussian  distribution  for  each  variable  and  parameter,  each 
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with  known  variance  (e.g.  Q(C(nYprop^)  ~  N{C{n)^  ^^(n)))-  The  Metropolis  algo¬ 
rithm  proceeds  as  follows: 

•  Initialize  X  (°). 

•  For  t  —  1  :  T 

-  Generate  Xlpropl  by  sampling  a  new  value  for  each  element  i  in  from 

that  element’s  proposal  distribution  Q(i). 

-  Use  Equation  3.16  (the  joint  likelihood)  to  compute  a  as  in  Equation  3.22  (the 
acceptance  probability  under  a  symmetric  proposal  density). 

-  If  a  =  1,  set  JW  =  X^r°P\ 

-  Otherwise,  set  =  X (Prop)  with  probability  a  and  X(J)  =  X(l~l>  with 
probability  1  —  a. 

•  Disregard  the  first  S  samples  (for  which  the  chain  has  not  yet  converged  to  the  true 
distribution). 

•  Compute  statistics  of  interest  (like  mean  and  variance)  about  the  hidden  variables 
and/or  parameters  from  the  remaining  T  —  S  samples. 

The  values  of  T  and  S  depend  on  problem  size. 

The  advantages  of  continuous  HPMs  are  that  they  relax  an  assumption  made  in  dis¬ 
crete  HPMs  that  is  obviously  wrong:  that  process  start  times  must  coincide  with  the  data 
acquisition  rate.  The  disadvantages  of  this  model,  in  addition  to  the  inability  to  model 
some  kinds  of  process  orderings  as  discussed  above,  are  the  difficulties  associated  with 
Metropolis  sampling.  While  Metropolis  sampling  is  guaranteed  to  converge  to  the  true 
distribution  as  /  — >  oo,  convergence  can  be  slow  because  the  hidden  state  space  is  ex¬ 
plored  with  a  random  walk,  and  it  can  be  hard  to  determine  whether  or  not  the  chain  has 
converged.  Convergence  can  also  be  sensitive  to  the  proposal  distribution  (Q( X)  above), 
the  distribution  over  the  latent  variables  and  parameters  ( P(X )  above),  and  the  initial 
values.  Furthermore,  since  the  samples  are  correlated,  we  need  to  collect  many  samples 
before  they  can  be  treated  as  effectively  independent. 

In  practice,  we  have  only  achieved  convergence  with  this  algorithm  on  synthetic  data 
for  a  toy  problem  (2  voxels).  Even  on  the  toy  problem,  convergence  only  occurred  when 
the  algorithm  was  initialized  with  parameters  equal  to  (or  very  nearly  equal  to)  the  true 
values  and  the  proposal  distribution  was  sharply  peaked  around  the  true  values.  Since  we 
do  not  expect  to  be  able  to  recreate  these  conditions  with  real  fMRI  data,  we  have  not 
applied  MCMC  to  the  sentence-picture  study. 
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3.4  Summary 


The  goal  of  the  extensions  in  this  chapter  is  to  provide  a  variety  of  modeling  choices  for 
HPMs.  We  showed  how  to  add  regularization  to  the  learning  algorithm  to  bias  the  process 
response  signature  parameters  toward  several  desirable  properties.  We  presented  an  alter¬ 
native  parameterization  of  the  process  response  signatures  using  a  set  of  basis  functions 
that  can  reduce  the  number  of  parameters  in  the  model  while  restricting  the  responses  to  a 
general  class  of  smooth  functions.  Finally,  we  introduced  continuous  HPMs  which  allow 
the  hidden  processes  to  begin  at  any  time,  regardless  of  whether  that  time  corresponds  to 
an  observed  data  point. 

All  of  the  extensions  in  this  chapter  retained  the  notion  of  configurations,  which  allows 
extreme  flexibility  in  specifying  the  hypothesis  space  of  the  model.  In  the  next  chapter,  we 
will  examine  a  version  of  HPMs  that  does  not  use  configurations. 
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Chapter  4 

A  Dynamic  Bayesian  Network 
Formulation  of  HPMs 


The  formalism  for  Hidden  Process  Models  presented  above  allows  a  great  deal  of  flexibil¬ 
ity  in  terms  of  the  hypothesis  space  of  the  model.  By  explicitly  listing  configurations  in 
the  HPM,  the  model  can  specify  the  number  of  process  instances,  the  order  of  process  in¬ 
stances,  and  interactions  between  the  start  times  of  process  instances,  simply  by  including 
or  not  including  configurations  that  satisfy  various  constraints  in  the  model.  The  cost  of 
this  flexibility  is  an  inference  algorithm  whose  complexity  grows  with  the  number  of  con¬ 
figurations,  since  every  configuration  must  be  considered  explicitly  to  determine  which 
one  is  most  likely.  If  we  have  significant  prior  knowledge  about  how  process  instances 
should  be  arranged  for  a  particular  data  set,  then  we  can  take  advantage  of  that  knowledge 
with  a  small  number  of  configurations.  However,  as  our  uncertainty  about  the  data  set 
grows,  we  experience  a  computational  explosion  in  the  number  of  configurations. 

In  this  chapter,  we  explore  a  fundamentally  different  way  to  encode  prior  knowledge 
in  HPMs.  We  present  an  alternative  modeling  framework  for  HPMs  that  does  not  use 
configurations  using  Dynamic  Bayesian  Networks  (DBNs)  (Murphy  [2002]),  discuss  the 
differences  between  this  model  and  the  previous  ones.  We  also  present  algorithms  for 
doing  Bayesian  inference  and  learning  in  this  framework.  We  will  refer  to  the  new  model 
as  an  HPM-DBN,  to  distinguish  it  from  the  HPMs  presented  above. 
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4.1  Writing  HPMs  as  DBNs 


In  this  section,  we  describe  how  to  write  a  Hidden  Process  Model  as  a  Dynamic  Bayesian 
Network  (DBN)  (Murphy  [2002]).  Factorial  Hidden  Markov  Models  (fHMMs,  Ghahra- 
mani  and  Jordan  [1997])  are  a  subclass  of  DBNs  that  can  express  many  of  the  assumptions 
of  HPMs  if  we  constrain  them  properly.  (An  introduction  to  fHMMs  is  provided  in  Ap¬ 
pendix  C  for  the  unfamiliar  reader.)  In  HPM-DBNs,  we  will  have  a  Markov  chain  for  every 
process  we  wish  to  model,  but  with  restricted  dynamics  to  reflect  our  assumptions  about 
processes.  Similarly  to  discrete  HPMs,  HPM-DBNs  will  define  a  Gaussian  probability  dis¬ 
tribution  over  the  observed  data  points,  the  mean  of  which  will  be  a  sum  of  contributions 
from  each  process-chain.  We  will  also  introduce  input  variables  for  each  process-chain 
to  model  landmarks.  We  now  proceed  to  the  specifics  and  notations  of  HPM-DBNs.  A 
diagram  of  an  example  HPM-DBN  is  provided  in  Figure  4. 1  to  supplement  the  text. 
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Figure  4.1:  Example  HPM-DBN.  Triangles  represent  binary  input  nodes,  squares  repre¬ 
sent  discrete  process  nodes,  circles  represent  continuous  output  nodes.  Shaded  nodes  are 
observed.  Each  process  has  an  arrow  from  its  corresponding  input  type  for  each  possible 
offset  in  fl  Notation  is  provided  for  time  slice  t  —  3. 
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4.1.1  Variables  of  HPM-DBNs 


To  write  an  HPM  as  a  DBN,  we  create  a  chain  of  discrete  state  variables  like  a  fHMM  for 
each  process  in  the  HPM.  Following  the  notation  of  Ghahramani  and  Jordan  [1997],  we 
will  write  St  to  mean  an  indicator  vector  holding  the  value  of  the  chain  corresponding  to 
process  it  at  time  t.  Each  state  variable  takes  on  values  from  0  to  d(ir),  the  duration  of  the 
process  to  which  it  corresponds,  so  has  length  d(n)  +  1.  The  dynamics  of  HPM-DBNs 
are  designed  so  that  a  process-chain  starts  at  0,  transitions  to  1  when  the  process  begins, 
and  deterministically  counts  to  its  duration  before  transitioning  back  to  0.  In  Figure  4.1, 
these  state  variables  are  the  square  unshaded  nodes. 

HPM-DBNs  also  have  a  layer  of  input  nodes  that  are  not  typically  present  in  fHMMs, 
represented  by  the  shaded  triangles  in  Figure  4.1.  These  nodes  contain  information  about 
the  timing  landmarks  in  the  experiment.  Each  chain  has  a  corresponding  binary  input 
value  at  every  time  step  called  Z;77'1  that  is  1  if  a  landmark  for  process  tt  occurred  at  time 
t  and  0  otherwise.  While  these  nodes  may  contain  information  about  different  kinds  of 
landmarks  for  the  same  process  (e.g.  second  stimulus  presentation  and  button  press)  we 
assume  that  the  distribution  governing  the  process  start  time  in  relation  to  the  landmark 
time  is  invariant  with  respect  to  the  type  of  landmark  and  treat  all  landmarks  for  a  given 
process  the  same.  The  set  of  nodes  influenced  by  a  particular  input  node  depends  on  the 
number  of  offsets  allowed  for  the  corresponding  process  (f2(7r));  there  is  a  line  from  i]"  '  to 
SiH  for  every  value  of  o  E  D(V).  HPM-DBNs  assume  that  the  set  of  input  nodes  {7^^} 

that  influences  S\7V)  are  either  all  zero  or  contain  only  a  single  one.  These  constraints  reflect 
the  assumption  that  only  a  single  instance  of  a  process  can  be  occurring  at  a  given  time. 

Finally,  an  HPM-DBN  contains  a  continuous  observed  variable  (the  shaded  circles  in 
Figure  4.1)  representing  the  data  at  each  time  step  called  Yt.  {>'}  ( {  F  }  =  (Ft}Vt)  does 
not  have  its  own  temporal  dynamics,  but  instead  depends  on  the  values  of  the  chains  at 
time  t,  which  we  will  write  {,5)},  as  shorthand  for  { S[ ' f ,  . . . ,  ,5'/<n)  }. 

4.1.2  Parameters  and  Distributions  of  HPM-DBNs 

The  set  of  parameters  T  of  an  HPM-DBN  contains  a  vector  of  parameters  called  that 
define  a  distribution  over  the  state  values  at  t  —  1  for  each  chain,  transition  matrices 
called  defining  the  temporal  dynamics  of  each  chain,  weight  matrices  called  lF*7r) 
that  contain  the  contributions  of  that  chain  to  the  emission  distribution  over  the  observed 
nodes  for  each  chain,  and  noise  variables  called  a2v  for  the  emission  distribution  for  each 
dimension  of  the  data  (v  =  [1, . , . ,  V ]).  We  now  discuss  these  parameters  in  more  detail, 
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along  with  the  distributions  they  define. 

First,  consider  the  probability  distribution  over  ,S'|7r)  for  some  chain  n,  which  depends 
on  /j7^,  P(yS[7T}\l[“'>).  We  define  =  {b^\  b^},  where  the  subscript  contains  the  value 
of  I{  .  With  these  vectors,  we  can  write: 


iogP(s<’r||/!'1) 


'd(  7r)+l 

n 


(i-/W) 


OES- 

0,k ) 


O) 

1  ,k 


'  d(n)+l 


rO) 


n  k 


M 

1  ,k 


k= 1 


k= 1 


(i  -  iPxsPyiogbP  +  ii*\sPy\ogbP 


(4.1) 


These  equations  describe  a  mixture  of  multinomial  distributions  over  the  values  of  S^\ 

(it) 

where  I±  selects  which  multinomial  to  use.  (Note  that  it  would  also  be  reasonable  to  fix 
=  0  and  l[A!  =  0  for  all  n  and  only  deal  with  the  transition  matrices  A(7T) .) 

The  probability  distribution  over  depends  on  S^\  and  some  set  of  the  input  vari¬ 
ables  (4-o(7r)}’  where  we  assume  that  if  some  value  of  {t  —  Q(V)}  lies  outside  the  legal 
range  of  time  steps  (t  e  [1,  T])  it  is  removed  from  the  set.  This  distribution  is  parameter¬ 
ized  by  a  set  of  transition  matrices  =  {A@  \  A^},  Vo  e  0(7r).  Each  transition  matrix 
in  this  set  covers  a  different  setting  of  the  input  variables  on  which  depends. 

A^  is  the  transition  matrix  for  the  case  where  all  of  the  are  0;  that  is,  none  of  the 

landmarks  for  process  n  occurred  during  the  legal  offset  window.  A0  is  the  transition 
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(n) 

matrix  for  the  case  in  which  It_0  =  1.  The  full  distribution  is  then  written  as: 


a(7r)+l  <i(7r)+l 

n  n 

1=1  j= i 


i-E 


o£f2(7r)  1t  —  0 
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oEQ(  7r) 
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o,iJ 1 


iogp(s«wisS,{/,(:E)})  = 


i-  E  44  (s;:i)' logos'”  + 

oeO(7r)  J 

E  (4H(s!iir  log A<’>si”' 

OG^(7r) 


(4.2) 


As  mentioned  above,  the  desired  dynamics  for  each  chain  are  largely  known  in  advance 
for  HPM-DBNs.  While  we  have  uncertainty  about  when  the  process  should  begin  (i.e.  at 
which  time  step  the  chain  should  go  from  0  to  1),  once  it  begins  it  must  count  to  its  duration 
and  return  to  0.  This  means  that  is  fully  deterministic;  since  it  reflects  the  case  where 
none  of  the  input  variables  are  one,  chain  tt  may  not  transition  from  0  to  1  (we  do  not 
allow  processes  to  begin  spontaneously)  and  must  continue  counting  if  its  value  is  already 
above  1.  (Once  a  chain  begins  counting,  we  ignore  the  input  values  until  it  resets  to  zero 
again.  This  is  consistent  with  our  assumption  that  only  one  instance  of  a  process  is  active 
at  a  time.)  In  each  transition  matrix  corresponding  to  an  offset  value  o,  there  is  only  1  free 
parameter  (e.g.  in  A^\  writing  A<^T\S^\,  S^}),  the  only  unknown  element  is  21^(0, 1) 
since  A^\ 0,  0)  =  1  —  A^\o,  1)).  This  free  parameter  closely  corresponds  to  0o(vr)  in 
discrete  HPMs.  In  that  case,  a  process  was  forced  to  occur  following  its  landmark  since 
tt)  @o(tt)  =  1.  We  can  either  remove  this  constraint  in  HPM-DBNs  and  let  the  free 

parameters  of  the  be  independent,  or  we  can  retain  this  constraint  by  renormalizing 
the  free  parameters.  For  instance,  for  O(zr)  =  [0, 1, 2],  ©(zr)  =  [0.1,  0.4,  0.5],  and  =  1, 
Aj°(0, 1)  =  0.1,  4^(0, 1)  =  0.5,  and  A^\ 0, 1)  =  1. 

The  emission  distribution  for  this  model  is  exactly  the  one  used  for  HPMs  in  Chapter 
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2: 


Yt\{St}  ~  N(J2(WW)'S?\  E) 
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Here,  is  a  V  x  1  vector,  {S'*}  represents  the  collection  of  |II|  state  variables  at  time  t, 
is  a  (D(7r)  +  l)xV  matrix  containing  the  response  signature  parameters  for  process 
7 r  augmented  with  a  row  of  zeros  for  St  =  0,  and  £  is  a  V  x  V  diagonal  covariance 
matrix  containing  the  a2  variables. 

Finally,  let  us  write  down  the  full  log-likelihood  of  the  HPM-DBN  model: 
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4.2 


Modeling  Differences  and  Extensions 


HPM-DBNs  are  not  equivalent  to  the  HPMs  presented  in  Chapter  2,  nor  the  extensions 
presented  in  Chapter  3.  The  earlier  models  based  on  configurations  allowed  the  explicit 
inclusion  or  exclusion  of  specific  arrangements  of  processes.  The  removal  of  configu¬ 
rations  from  the  model  entails  an  arguably  more  elegant,  but  technically  less  expressive 
distribution  over  the  hidden  variables  describing  the  underlying  processes. 

For  instance,  the  HPMs  discussed  in  earlier  chapters  allowed  multiple  overlapping 
process  instances  of  the  same  type.  HPM-DBNs  do  not  allow  this  phenomenon.  One 
might  think  it  would  be  a  simple  extension,  in  that  the  state  of  a  chain  could  be  the  sum  of 
the  values  for  each  process  instance.  The  difficulty  is  in  disambiguating  the  contribution 
of  each  instance.  For  example,  if  the  state  value  of  a  chain  at  a  particular  time  is  7,  there 
are  several  possible  interpretations.  This  node  might  represent  a  single  process  that  began 
7  time  steps  ago,  or  it  might  represent  two  instances  that  began  2  and  5  time  steps  ago,  or 
two  instances  that  began  3  and  4  time  steps  ago,  or  three  instances  that  began  1  and  2  and 
4  time  steps  ago,  etc. 

Another  difference  between  configuration-based  HPMs  and  HPM-DBNs  is  that  con¬ 
figurations  easily  accomodate  process  ordering  constraints,  whereas  HPM-DBNs  as  dis¬ 
cussed  so  far  do  not.  By  choosing  the  set  of  configurations  carefully,  constraints  on  pro¬ 
cess  ordering  are  encoded  by  simply  including  configurations  that  respect  the  constraint 
and  excluding  those  that  do  not.  In  HPM-DBNs  as  discussed  so  far,  this  type  of  constraint 
is  not  present.  The  start  times  of  the  process  chains  are  treated  independently,  and  all 
possibilities  are  considered. 

Process  ordering  constraints  can  be  reintroduced  into  HPM-DBNs  by  adding  arrows 
from  one  chain  to  another,  connecting  time  steps  t  of  the  first  chain  to  time  steps  t  +  5  of 
another,  where  5  is  the  required  temporal  delay  between  the  processes.  An  example  of  this 
is  shown  in  Figure  4.2,  where  in  this  case  we  want  the  Decide  process  to  begin  before  the 
PressButton  process.  We  do  not  need  any  new  parameters  to  achieve  this  constraint;  we 
simply  change  the  expression  for  P(S^\S^1,  {/^L  )})  slightly  to  also  depend  on 
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where  n'  is  the  chain  on  which  7r  depends: 


'  d{ 7r)+l  d{ 7r)+l 

n  n 

1=1  j= i 

/ 


l-5(5t(:'i>0)Eoen{.,4-)o 
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n 


f  d(ir)-\-l  d( 7r)+l 


n  n 

i= 1  3= 1 


V 


J 


1  -  s(siU  >  o)  £  tl  (Sii\y  log  + 

o€fl(7r)  J 

s(s$>  o)  E  (4:UsS)' logos''*) 

OG^(7r) 

(4.5) 


where  the  delta  function  d(-)  returns  1  if  its  argument  is  true  and  0  otherwise.  This  equation 
simply  defines  the  conditions  under  which  each  of  the  transition  matrices  is  to  be  used.  If 
d(S'{yj  >  0)1™  is  1,  then  it  is  true  that  n'  has  begun  and  l)V0  is  1,  so  the  appropriate 

transition  matrix  is  /io  ]  ■  If  1  —  8  (S^J  >  0)  520It-o ’s  1>  then  it  is  true  that  either  n' 
has  not  yet  started  (its  value  is  still  0)  and/or  none  of  the  input  values  in  the  legal  window 
t  —  O(V)  are  1.  In  this  case,  the  conditions  for  the  start  of  process  7r  are  not  met,  and  we 
use  A^\ 
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In  designing  a  model  with  process  ordering  constraints,  one  must  be  aware  of  the 
possibility  that  a  ill-designed  constraints  could  potentially  disallow  a  process  instance. 
For  instance,  if  the  start  time  windows  for  two  processes  are  identical,  and  we  constrain 
the  model  such  that  one  must  begin  strictly  before  the  other,  then  choosing  the  last  possible 
start  time  for  the  first  process  precludes  the  second  process  from  occuring  at  all,  since  all 
of  its  possible  start  times  violate  the  ordering  constraint.  In  some  cases  this  consequence 
may  be  acceptable;  in  other  cases  it  may  be  undesirable.  We  will  leave  this  concern  to  the 
model  designer. 


4.3  Algorithms  for  HPM-DBNs 

HPM-DBNs  without  process  ordering  constraints  can  be  treated  just  like  fHMMs  for  do¬ 
ing  inference  and  learning,  since  the  addition  of  an  observed  input  layer  and  restricting 
the  transition  dynamics  does  not  add  any  complexity  to  the  model.  Appendix  C  discusses 
these  algorithms.  Below,  we  discuss  algorithms  for  HPM-DBNs  with  process  ordering 
constraints.  Introducing  dependencies  between  process  chains  in  the  model  to  encode 
these  constraints  creates  couplings  between  the  chains  beyond  just  the  shared  responsibil¬ 
ity  for  the  output  variable.  As  a  result,  we  can  no  longer  use  standard  fHMM  algorithms  to 
do  inference  and  learning  in  this  model.  Appendix  E  develops  a  variational  approximation 
for  HPM-DBNs  with  process  ordering  constraints  based  on  the  structured  approximation 
for  fHMMs.  The  rest  of  this  chapter  discusses  Markov  Chain  Monte  Carlo  (MCMC) 
sampling  algorithms  (MacKay  [1998,  2003])  to  do  Bayesian  inference  and  learning  in 
HPM-DBNs.  Note  that  the  Bayesian  inference  approach  is  not  specific  to  the  HPM-DBN 
representation;  we  could  train  HPM-DBNs  using  maximum  likelihood  techniques  (with 
or  without  regularization),  and/or  we  could  develop  Bayesian  inference  techniques  for  the 
HPMs  of  Chapter  2  as  well. 


4.4  A  MCMC  Sampling  Algorithm  for  HPM-DBNs 

The  inference  problem  in  HPM-DBNs  is  to  recover  the  sequence  of  hidden  states  {5'(7r')} 
for  each  chain  7r  given  the  observed  inputs  {/},  the  observed  outputs  {Y },  and  the  param¬ 
eters  of  the  model  T.  The  learning  problem  is  to  estimate  the  parameters  of  the  model  T 
given  the  observed  inputs  {/}  and  the  observed  outputs  {Y}. 

Gibbs  sampling  is  a  Markov  Chain  Monte  Carlo  (MCMC)  method  that  addresses  both 
inference  and  learning  (MacKay  [2003,  1998]).  MCMC  methods  iteratively  generate  sam- 
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pies  of  a  hidden  variable  x.  The  sample  at  iteration  t  called  {x(t)}  depends  on  the  sample 
at  iteration  t  —  1  ({x^~  1 }}),  and  the  sequence  of  iterations  defines  a  Markov  chain.  As 
t  — >  oo,  {x(l'1}  tends  to  the  true  distribution  P(x).  In  Gibbs  sampling,  each  dimension 
Xi  of  the  hidden  variable  is  sampled  from  its  full  conditional  distribution  given  all  other 
dimensions. 

To  do  Gibbs  sampling  for  learning  in  HPM-DBNs,  the  hidden  variables  to  be  sampled 
are  {S'}  and  T.  We  can  think  of  the  dimensions  of  these  hidden  variables  as  a  grouping 
of  the  components  that  can  be  sampled  individually  at  each  iteration,  conditioned  on  ev¬ 
erything  else.  In  our  case,  we  sample  the  values  of  each  chain  in  turn  ({S'^}  one  7r  at  a 
time),  and  then  sample  the  values  of  the  parameters  of  the  model: 

1.  Initialize  values  for  {S1}^  and  Tl0k 

2.  For  iter  —  1  :  N 

(a)  For  7T  =  1  :  IT: 

i.  Sample  }(««■)  from  P({5,(’r)}|{/},  {S^'^}^ter~ i),  |F}, 

(b)  Sample  from  |{J},  }(««■),  {y}). 

If  the  parameters  are  known,  we  can  do  inference  by  just  sampling  the  chains  and  not 
updating  the  parameters  (i.e.  omitting  step  2b). 

Below,  we  examine  each  step  in  greater  detail.  Section  4.4. 1  adapts  a  sampling  method 
presented  in  Scott  [2002]  for  Hidden  Markov  Models  (HMMs)  similar  to  the  forward- 
backward  algorithm  to  sample  each  chain  while  holding  the  other  chains  and  parameters 
fixed.  Section  4.4.2  gives  the  full  conditional  distributions  for  each  parameter  to  be  sam¬ 
pled. 

4.4.1  Sampling  Chains  in  HPM-DBNs 

In  Step  2(a)i  above,  we  wish  to  sample  {S'W}  from  its  full  conditional  distribution,  given 
all  other  variables  in  the  model:  {5'0rV7r)}0ter-1))  {!'},  i) )  can 

accomplish  this  with  a  simple  modification  to  an  existing  algorithm  for  sampling  the  hid¬ 
den  states  of  a  Hidden  Markov  Model  (HMM)  (Scott  [2002]). 

The  method  described  in  Scott  [2002]  is  based  on  the  forward-backward  algorithm  for 
HMMs  (Rabiner  [1989]).  The  forward-backward  algorithm  efficiently  computes  the  prob¬ 
ability  distribution  over  the  hidden  variables  in  an  HMM  (the  unobserved  state  sequence) 
given  the  observed  data  and  parameters.  Let  us  call  the  hidden  states  of  the  HMM  St  and 
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the  data  Yt,  for  time  steps  t  =  [1, . . . ,  T],  and  let  the  set  of  all  parameters  of  the  HMM  be 
\l/.  The  forward  pass  computes  matrices  Ft  for  t  =  [2, . . . ,  T]  where  Ft(r,  s )  oc  P(St- 1  = 
r,  St  =  s,  Yt\Y1:t_i,  'k),  where  Y\,t  denotes  the  sequence  { YT }  where  r  =  [1, . . t].  The 
backward  pass  computes  matrices  Bt  where  Bt(r,  s)  oc  P(St- 1  =  r,  St  =  s\Yi:T,  'k).  Pro¬ 
portionality  for  each  matrix  is  reconciled  by  Ft(r,  s)  =  1  and  ^ s  Bt(r,  s )  = 

1.  Note  that  this  normalization  means  that  Ft(r,  s)  =  P(St_ i  =  r,  St  =  s|Y1:t,  T). 

A  method  for  recursively  sampling  the  states  of  an  HMM  on  the  backward  pass  is 
presented  in  Scott  [2002].  The  stochastic  backward  recursion  begins  by  sampling  a  value 
for  St  from  the  last  result  of  the  forward  pass,  summed  over  all  possible  state  values  for 
St-i-  That  is,  St  is  sampled  from  Y^r  Fx(r, :)  =  -P(<Sr,  St- i  =  r\Yi:T,  vk).  Given  a 

sampled  value  St,  St- 1  is  recursively  sampled  from  the  appropriate  column  of  Ft.  That  is, 
we  draw  St_ i  from  Ft(:,  St). 

The  extension  to  HPM-DBNs  is  straightforward.  We  consider  three  cases:  those  that 
influence  another  chain,  those  that  are  influenced  by  another  chain,  and  those  that  are  only 
coupled  to  other  chains  through  the  output  variables. 


Case  1:  sampling  a  chain  without  direct  links  to  other  chains 


For  a  chain  without  direct  links  to  other  chains,  which  we  will  call  chain  n  —  1,  the 
forward  pass  computes  P(Yi:t,  J(P)  recursively  for  all  t,  and  the  backward  pass 

samples  S^ 1 }  from  P(S^1'>\Yi:T,  \k)  recursively  for  all  /.  More  specifically,  the  forward 
recursion  constructs  (d(  1)  +  1)  x  (d(  1)  +  1)  matrices  Ft  oc  P(S^1,  S^,  Yt\Y1:t_i,  'k)  for 
t  =  [2, . . . ,  T]  as  follows: 


Fj(r,s)  a  P(sf  =  >t)/>(Vi|s[1)  =  r,  {s!^1’},  <P)  x 

P(sy  =  sis!11  =  r,  {/«},  >p)p(yi|s<1)  =  5,  {S^1*}.  * ) 

F.(r,  s )  a  P(S!L\  =  r| y,*-,,  {/“>},  <t)P(S,m  =  s |S«  =  r,  {/<"},  <I>) 

=  EfF.-ife  r)]P(Sf  =  S|SS  =  r,  {/<»},  A<») 

<? 

p(yt|st(1)  =  s,{stwl)},iy,E) 

(4.6) 


where  proportionality  is  reconciled  with  Ft(r,  s)  =  1.  All  of  these  distributions 

are  defined  in  the  probabilistic  model  described  in  Section  4.1.  The  stochastic  backward 
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pass  samples  a  value  from  the  distribution  specified  by  the  appropriate  column  of  the  Ft 
matrix  as  follows  (the  7  indicates  a  sampled  value): 

Sp  ~  £Ft(>V) 

r 

sh  ~  f.i-.s;1’) 

(4.7) 


Case  2:  sampling  a  chain  with  outgoing  links  to  another  chain 

Now  consider  chain  n  =  2  with  links  from  its  nodes  at  times  t  to  the  nodes  of  a  different 
chain  (let  us  call  this  chain  tt  =  3)  at  t  +  5.  Since  Gibbs  sampling  assumes  the  values  of 
all  variables  not  being  sampled  in  a  given  step  are  fixed  and  known,  we  can  treat  the  fixed, 
known  values  {S’®}  essentially  as  additional  observations  in  the  model,  as  we  did  with 
{/}  and  { Y }  in  Case  1.  Now  the  forward  recursion  becomes: 

Ft(r,  s)  cx  r)]F>(S<2)  =  s|S<?,  =  r,  {/<2>},  A^) 

Q 

p(r,|s,(2)  =  s,  w,  z)p(s!%  is!%  s,(2),  {/<=»},  a(3() 

(4.8) 

with  appropriate  adjustments  for  end  effects,  and  with  proportionality  again  reconciled 
with  r  Ss  Ft(r,  s)  =  1.  The  backward  sampling  is  the  same  once  the  Ft  are  con¬ 
structed. 


Case  3:  sampling  a  chain  with  incoming  links  from  another  chain 


Chain  n  =  3  described  above  has  incoming  links  from  chain  n  =  2.  This  again  requires 
a  slight  change  in  the  construction  of  the  Ft,  but  again  the  change  is  simple  since  we 
can  treat  the  values  of  {/S'^2-) }  as  fully  observed  when  sampling  { S(:V> } .  Here  the  forward 
matrices  are  constructed  as  follows: 


F,(r,s)  oc  5][Ft_1(9,r)]P(Sf)=s|St(!)1=r,sS{/(3|},4l,3)) 
q 

Finish  =  S,{Si'*l>},W,Z) 


(4.9) 


Once  more,  we  must  normalize  such  that  J2r  Ft(r,  s)  =  1  and  adjust  for  end  effects, 
after  which  the  backward  recursion  is  the  same. 
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4.4.2  Updating  the  HPM-DBN  Parameters 


Having  sampled  values  for  the  hidden  states  {S'}  on  a  given  iteration  of  the  Gibbs  sampling 
procedure,  the  next  step  is  to  sample  new  values  for  the  parameters  T.  Recall  that  the 
distribution  to  be  sampled  from  is  the  full  conditional  distribution  of  'b  given  everything 
else  in  the  model: 


P(tf|{I,S,Y»  oc  P({/,P,Y}|vb)P(vb) 


(4.10) 


The  likelihood  P({I,  S,  Y}|\b)  is  defined  in  Section  4.1.  Let  us  define  the  priors  over  the 
components  of  'b  to  be  a  priori  independent: 


pm 


log  P(\b) 


ii 

mn  P(bM)P(AW)P(W{7r)) 

7T— 1 

n 

p(s)  n  p(bP)p{bP)p(wM)p(A$>) 

71=1 

n 

logP(S)  +  J2[logP(bP)  +  log  P{b™) 

7T— 1 

log  P(A{’))+  J2  1°15^(4'))] 

o£f2(7r) 


n  p<4')) 

OEQ(7r) 

+  log  P(W^)  + 


(4.11) 


So  for  some  normalizing  constant  c  we  have: 
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n 

iogF(t|{/,s,y})  =  c  +  ^[iogp(s|’>|/!',,4<*>)]  + 

7T—  1 

E  Eiiogp(s,('lis«:>1,  {s^m''1)]  + 

£  =  2  7T— 1 
T 

ElogF«|{S,},^S)  + 

t=l 

n 

log  P(E)  +  J>gP(^)  +  logP(&?°)  +  logP(wW)  + 

7T— 1 

logF(4'))+  E  108^(4’’)] 

OG^(7r) 

(4.12) 


For  each  element  of  T  in  turn,  we  can  choose  an  appropriate  prior  distribution  and 
derive  its  full  conditional  distribution  by  grouping  terms  not  involving  the  parameter  of  in¬ 
terest  into  the  normalizing  constant  c.  We  summarize  these  distributions  below;  Appendix 
D  provides  more  detail  on  how  each  distribution  was  derived. 

First  consider  the  parameters  b^\  Since  contains  the  parameters  of  a  multinomial 

(n) 

distribution,  we  use  the  conjugate  prior,  a  Dirichlet  distribution,  with  parameters  a()  (a 
x  1  vector).  Then  the  full  conditional  distribution  from  which  we  sample  a  new  value 
of  6(j7r)  at  each  Gibbs  iteration  is  also  Dirichlet,  with  parameters  (1  —  +  olq\  The 

parameters  b^  are  treated  similarly. 

Next  consider  the  transition  matrices.  Recall  that  each  A^1  contains  only  one  free 

in)  (n) 

parameter,  which  we  will  call  py0  .  Since  p0  is  effectively  the  parameter  of  a  Binomial 
distribution,  the  conjugate  prior  P(pi^)  is  a  Beta  distribution  with  parameters  and 
7 o  .  This  results  in  another  Beta  distribution  to  sample  at  each  Gibbs  iteration,  with 
parameters 

t= 2 


and 


El^'-i  >  o)j<7si(!i“si('),°]  +  7<r> 


t= 2 
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The  weight  matrices  containing  the  contributions  of  each  chain  to  the  output  are 
(rf(7r)  +  1)  x  V  and  we  will  sample  them  one  row  (W^})  at  a  time.  Under  a  multivariate 
Gaussian  prior  with  parameters  /j^  (a  U  x  1  vector)  and  Y!'p  (a  U  x  U  matrix),  we  obtain 
a  multivariate  Gaussian  sampling  distribution  with  parameters 


-l 


+  (E?)w 


e-1 


j.= i 

and 


t= i 


TV'^TT 


-1 


J.=  1 


Finally,  we  can  treat  the  elements  of  the  diagonal  of  E  independently,  so  we  define  an 
inverse-Gamma  prior  for  each  a'l  with  parameters  //,,  and  vv.  This  results  in  an  inverse- 
Gamma  to  sample  from,  with  parameters  \T  +  rjv  and  vv  +  |  \0^t,v  ~  IM,v)2]- 

In  summary,  this  chapter  has  presented  HPM-DBNs,  a  Dynamic  Bayes  Net  similar  to 
a  factorial  HMM.  The  assumptions  made  by  HPM-DBNs  are  similar  to  those  made  by 
the  HPMs  presented  in  the  previous  chapters,  but  as  we  have  discussed,  they  are  not  the 
same.  In  particular,  the  HPM-DBNs  we  described  cannot  accomodate  temporally  overlap¬ 
ping  instances  of  the  same  process,  whereas  discrete  HPMs  can.  Discrete  HPMs  generally 
have  more  flexibility  than  HPM-DBNs  in  restricting  the  hypothesis  space  through  config¬ 
urations.  Another  example  of  this  flexibility  is  that  discrete  HPMs  can  use  different  sets 
of  configurations  for  different  trials,  whereas  the  HPM-DBN  we  described  maintains  the 
same  structure  for  all  trials.  For  experiment  designs  and  models  where  this  flexibility  is 
not  necessary,  the  major  difference  between  discrete  HPMs  and  HPM-DBNs  is  the  use  of 
EM  versus  sampling  to  learn  the  model  parameters.  Based  on  the  lack  of  convergence  for 
the  sampling  algorithm  for  continuous  HPMs  in  such  a  sparse,  high-dimensional  setting, 
we  anticipated  similar  problems  and  did  not  implement  the  sampling  algorithm  for  HPM- 
DBNs.  The  next  chapter  presents  experimental  results  for  the  HPMs  discussed  in  Chapters 
2  and  3. 
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Chapter  5 

Experimental  Results 


In  this  chapter,  we  present  experimental  results  using  HPMs.  We  first  demonstrate  the 
performance  of  HPMs  on  a  synthetic  dataset  for  which  we  know  our  modeling  assumptions 
are  sound  and  about  which  we  have  ground  truth.  We  then  give  an  analysis  of  the  real 
sentence-picture  fMRI  dataset  exploring  the  extensions  described  in  Chapter  3. 


5.1  Synthetic  Data 

In  this  section,  we  explore  the  performance  of  HPMs  on  synthetic  data.  The  benefit  of 
using  synthetic  data  is  the  ability  to  compare  results  to  ground  truth.  With  synthetic  data, 
we  can  determine  how  well  the  learned  process  response  signature  parameters  match  the 
true  responses  that  generated  the  data  (something  for  which  we  have  no  ground  truth  in 
the  real  data).  We  can  also  justify  the  use  of  cross-validated  data  log-likelihood  as  an 
evaluation  metric  by  showing  that  it  selects  the  model  with  the  correct  number  of  processes 
on  synthetic  data. 

Of  course,  the  drawback  of  synthetic  data  is  that  we  cannot  really  know  whether  it  is 
a  good  approximation  to  real  data.  In  fact,  since  we  generate  synthetic  data  from  a  true 
HPM  and  add  independent,  identically  distributed,  Gaussian  noise  to  each  point  in  space 
and  time,  we  expect  that  the  synthetic  data  is  significantly  easier  to  model  than  real  data, 
which  may  or  may  not  be  consistent  with  the  HPM  assumptions  and  certainly  has  more 
complex  noise  properties.  For  example,  all  of  the  synthetic  data  experiments  were  run  with 
the  standard  version  of  HPMs,  and  as  detailed  below,  the  standard  version  was  successful. 
However,  as  described  later  on,  standard  HPMs  suffered  from  overfitting  in  the  real  data 
experiments,  and  regularization  and  basis  functions  were  needed  to  overcome  the  problem. 
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5.1.1  Data  Generation 


The  synthetic  data  was  created  to  roughly  imitate  the  real  sentence-picture  experiment. 
We  created  datasets  using  two  synthetic  processes  (ViewPicture  and  ReadSentence),  three 
processes  (adding  Decide),  and  four  processes  (adding  PressButton).  For  each  experiment, 
all  of  the  voxels  responded  to  all  of  the  processes.  It  is  unlikely  that  all  the  voxels  in  the 
brain  would  respond  to  all  processes  in  an  experiment,  but  we  wished  to  test  HPMs  in  the 
most  challenging  setting  possible:  maximal  spatial-temporal  overap  among  the  HRFs  of 
different  processes. 

For  each  dataset,  the  HRF  for  each  process  was  generated  by  convolving  a  boxcar 
function  indicating  the  presence  of  the  stimulus  with  a  gamma  function  (following  Boyn¬ 
ton  et  al.  [1996])  with  parameters  (a,  r,  n}  of  the  form 


h{t)  =  a  * 


W'-'exptf) 

r(n  —  1)! 


data. 


The  parameters  for  the  gamma  functions  were  (8.22, 1.08, 3}  for  ViewPicture,  (8,  2.1, 2} 
for  ReadSentence,  and  (7.5, 1.3,  3}  for  Decide.  The  processes’  gamma  functions  were 
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convolved  with  a  4-second  boxcar,  matching  the  experiment  timeline.  These  responses  are 
shown  in  Figure  5.1. 

The  ViewPicture  and  ReadSentence  processes  had  offset  values  of  {0,1},  meaning 
their  onsets  could  be  delayed  0  or  1  images  (0  or  0.5  seconds)  from  their  corresponding 
stimuli.  The  Decide  process  had  offset  values  of  {0, 1,  2,  3, 4,  5},  meaning  its  onset  could 
be  delayed  0-5  images  (0-2.5  seconds)  from  its  corresopnding  stimulus,  which  in  each  case 
was  the  second  stimulus  presentation,  whether  it  was  a  picture  or  a  sentence.  The  synthetic 
HPMs  were  roughly  equivalent  to  models  used  in  the  real  data  experiments  (HPM-2-U, 
HPM-3-U,  and  HPM-4-U,  described  below). 


Synthetic  2-process  data 


Figure  5.2:  Two-process  synthetic  data:  a  single  voxel  timecourse  for  two  trials,  with  and 
without  noise.  The  first  trial  (the  left  half  of  the  time  series)  is  a  picture  followed  by  a 
sentence;  the  second  trial  is  the  reverse. 


To  generate  the  data  for  a  new  trial  in  the  two-process  dataset,  we  first  chose  which 
stimulus  would  come  first  (picture  or  sentence).  We  required  that  there  be  an  equal  number 
of  picture-first  and  sentence-first  trials.  When  a  picture  stimulus  occurred,  we  selected 
an  offset  from  the  ViewPicture  offsets  according  to  the  distribution  specified  by  for  that 
process  and  added  it  to  the  stimulus  onset  time  to  get  the  process  start  time.  Then  we  added 
the  ViewPicture  HRF  (over  all  voxels)  to  the  synthetic  data  beginning  at  that  start  time. 
Sentences  were  dealt  with  similarly,  where  overlapping  HRFs  summed  linearly.  Finally, 
we  added  Gaussian  noise  with  mean  0  and  standard  deviation  2.5  independently  to  every 
voxel  at  every  time  point.  The  three-process  dataset  was  generated  similarly,  except  that 
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Synthetic  3-process  data 


Figure  5.3:  Three-process  synthetic  data:  a  single  voxel  timecourse  for  two  trials,  with 
and  without  noise.  The  first  trial  (the  left  half  of  the  time  series)  is  a  picture  followed  by  a 
sentence;  the  second  trial,  which  starts  at  time  60,  is  the  reverse.  The  second  peak  in  each 
trial  is  higher  than  in  Figure  5.2  because  the  three-process  data  includes  activation  for  the 
Decide  process. 


each  trial  included  a  Decide  process  as  well,  whose  onset  was  chosen  according  to  the 
timing  distribution  of  that  process,  and  added  to  the  second  stimulus  onset  time  to  get  the 
process  start  time.  Figure  5.2  shows  the  timecourse  of  a  single  voxel  for  a  sequence  of 
two  trials  from  the  two-process  dataset  (with  and  without  noise),  and  Figure  5.3  shows  the 
same  for  the  three-process  dataset.  In  both  cases,  the  first  trial  is  a  picture  followed  by  a 
sentence;  the  second  trial  is  the  reverse.  The  same  general  process  was  used  to  generate 
four-process  data. 
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5.1.2  Estimation  of  the  Hemodynamic  Responses 

To  test  whether  HPMs  can  accurately  recover  the  true  process  response  signatures  un¬ 
derlying  the  data,  we  compared  the  learned  response  signatures  to  the  true  responses  in 
the  two  and  three-process  synthetic  datasets.  These  datasets  each  had  only  2  voxels  (the 
small  number  of  voxels  was  chosen  to  easily  show  the  learned  responses,  even  though  the 
learning  problem  is  easier  with  more  informative  voxels).  In  each  case,  we  trained  the 
model  on  40  trials  (the  number  of  trials  we  have  in  the  real  data)  using  standard  HPMs  (no 
regularization  or  basis  functions). 

In  both  datasets,  the  identity  of  the  process  instances  in  each  trial  were  provided  to 
the  learning  algorithm  via  the  configurations,  but  their  timings  were  not.  This  corresponds 
to  reasonable  assumptions  about  the  real  fMRI  data.  Since  we  know  the  sequence  of  the 
stimuli,  it  is  reasonable  to  assume  that  the  process  instances  match  that  sequence.  How¬ 
ever,  we  do  not  know  the  delay  between  the  stimulus  presentation  and  the  beginning  of  the 
cognitive  process(es)  associated  with  it.  This  is  consistent  with  the  sets  of  configurations 
used  in  the  real  data  experiments. 


Two  trials  of  synthesized  data  for  the  2-process  experiment 


Figure  5.4:  A  sequence  of  two  trials  of  synthetic  data  for  the  2-process  experiment  using 
2  voxels.  The  first  trial  (the  left  half  of  the  time  series)  is  a  picture  followed  by  a  sentence; 
the  second  trial,  starting  at  time  60,  is  the  reverse. 


A  sequence  of  two  trials  of  the  two-process  data  are  shown  in  Figure  5.4,  and  the 
learned  responses  for  each  process  in  each  voxel  are  shown  in  Figure  5.5.  Note  that  the 
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Figure  5.5:  Comparison  of  the  learned  vs.  true  process  response  signatures  in  the  two- 
process  data  for  two  voxels.  The  mean  squared  error  between  the  learned  and  true  re¬ 
sponses  averaged  over  timepoints,  processes,  and  voxels  is  0.2647. 


learned  responses  are  reasonably  smooth,  even  though  this  assumption  was  not  provided  to 
the  learner.  The  mean  squared  error  between  the  learned  and  true  responses  averaged  over 
timepoints,  processes,  and  voxels  is  0.2647,  and  the  estimated  standard  deviations  for  the 
voxels  are  2.4182  and  2.4686  (compare  with  the  true  value  of  the  standard  deviation  of  the 
noise  added  to  the  training  data,  which  was  2.5).  Figures  5.6  and  5.7  are  the  corresponding 
plots  for  the  three-process  data.  In  this  case,  the  mean  squared  error  between  the  learned 
and  true  responses  averaged  over  timepoints,  processes,  and  voxels  is  0.4427,  and  the 
estimated  standard  deviations  for  the  voxels  are  2.4316  and  2.4226.  The  parameters  of 
the  timing  distributions  for  the  processes  were  also  estimated  accurately  in  both  the  two- 
process  and  three-process  experiments.  The  mean  squared  error  averaged  over  the  10  6 
parameters  in  the  three -process  experiment  was  0.01. 
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Two  trials  of  synthesized  data  for  the  3-process  experiment 


Figure  5.6:  A  sequence  of  two  trials  of  synthetic  data  for  the  3-process  experiment  using 
2  voxels.  The  first  trial  (the  left  half  of  the  time  series)  is  a  picture  followed  by  a  sentence; 
the  second  trial,  starting  at  time  60,  is  the  reverse. 
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Figure  5.7:  Comparison  of  the  learned  vs.  true  process  response  signatures  in  the  synthetic 
three-process  data  for  two  voxels.  The  mean  squared  error  between  the  learned  and  true 
responses  averaged  over  timepoints,  processes,  and  voxels  is  0.4427. 
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5.1.3  Choosing  the  Number  of  Processes  to  Model 

Another  interesting  question  to  approach  with  synthetic  data  is  whether  the  use  of  cross- 
validated  data  log-likelihood  for  model  selection  in  HPMs  is  justified.  One  might  be  con¬ 
cerned  about  bias  toward  HPMs  with  larger  numbers  of  processes,  but  we  expect  that  the 
potential  for  overfitting  will  be  mitigated  by  the  use  of  cross-validation.  In  fact,  there  can 
be  a  slight  bias,  even  when  using  separate  test  data,  in  favor  of  models  that  allow  a  larger 
number  of  possible  configurations.  This  bias  occurs  because  we  compute  the  posterior 
distribution  over  configurations  for  each  test  trial,  meaning  that  we  can  adjust  the  mixture 
of  configurations  to  best  explain  each  test  trial.  Since  we  can  make  this  adjustment  at  test 
time,  there  can  be  a  bias  toward  more  flexible  models,  like  those  with  a  larger  number 
of  possible  configurations.  There  can  also  be  a  bias  during  training  toward  configurations 
that  cover  longer  intervals  of  time  within  the  trial.  We  attempt  to  control  this  source  of  bias 
by  filling  in  the  mean  training  trial  for  any  time  points  that  are  not  influenced  by  the  pro¬ 
cess  instances  when  we  compute  the  data  log-likelihood.  As  we  show  below,  the  impact 
of  these  potential  biases  in  our  synthetic  data  experiments  does  not  prevent  HPMs  from 
recovering  the  correct  model. 

To  investigate  this  question,  we  generated  training  sets  of  40  examples  each  using  2, 
3,  and  4  processes  in  the  same  fashion  as  the  previous  experiments.  For  each  training  set, 
we  trained  HPMs  with  2,  3,  and  4  processes  each.  Additionally,  we  generated  test  sets 
of  100  examples  each  using  2,  3,  and  4  processes.  Every  training  and  test  set  had  100 
voxels.  For  each  test  set,  we  used  each  of  the  HPMs  trained  on  its  corresponding  training 
set  to  evaluate  the  log-likelihood  of  the  test  data  under  the  model.  More  specifically,  we 
used  the  log-likelihood  of  the  test  data  under  the  configuration  with  the  highest  posterior 
probability  for  each  trial.  In  each  case,  the  model  with  the  best  score  was  the  one  with  the 
correct  number  of  processes.  We  performed  this  experiment  with  independently  generated 
training  and  test  sets  30  times  with  consistent  results.  The  average  test-set  log-likelihoods 
are  shown  in  Table  5.1,  along  with  further  repetitions  of  this  experiment  with  decreasing 
training  set  sizes.  As  expected,  we  see  increased  variability  with  smaller  training  sets,  but 
even  for  small  numbers  of  examples,  the  HPM  with  the  highest  test  data  log-likelihood  has 
the  same  number  of  processes  as  were  used  to  generate  the  data. 


5.2  Sentence-Picture  Verification  fMRI  Data 

Here  we  present  results  on  the  sentence-picture  verification  fMRI  dataset  described  above. 
Recall  that  in  this  dataset  (Keller  et  al.  [2001]),  participants  were  presented  with  a  sequence 
of  40  trials.  In  half  of  the  trials  participants  were  shown  a  picture  (involving  vertical 
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Number  of 
Training 
Trials 

Number  of 

Processes 

in  HPM 

2  Process  Data 

3  Process  Data 

4  Process  Data 

40 

2 

-5.64  ±0.00444 

-7.93  ±  0.0779 

-7.72  ±0.0715 

40 

3 

-7.47  ±0.183 

-5.66  ±0.00391 

-5.72  ±  0.00504 

40 

4 

-7.19  ±0.0776 

-5.687  ±0.00482 

-5.65  ±0.00381 

20 

2 

-2.87  ±0.204 

-3.80  ±0.192 

-3.70  ±0.606 

20 

3 

-4.00  ±0.0461 

-2.86  ±0.00597 

-2.87  ±0.00276 

20 

4 

-3.91  ±0.0319 

-2.89  ±  0.00320 

-2.85  ±0.00364 

10 

2 

-1.44  ±0.245 

-2.07  ±0.0653 

-1.96  ±0.0665 

10 

3 

-1.99  ±0.119 

-1.47  ±0.0231 

-1.47  ±0.00654 

10 

4 

-1.95  ±0.0872 

-1.49  ±0.0195 

-1.46  ±0.00427 

6 

2 

-2.87  ±0.204 

-1.32  ±0.0363 

-1.34  ±0.297 

6 

3 

-4.00  ±  0.0461 

-0.923  ±0.0130 

-0.928  ±0.0126 

6 

4 

-3.91  ±0.0319 

-0.933  ±0.0149 

-0.921  ±0.00976 

2 

2 

-3.75  ±0.00710 

-7.23  ±  0.0456 

-4.62  ±0.0383 

2 

3 

-5.36  ±0.0689 

-6.99  ±0.0823 

-3.96  ±0.0252 

2 

4 

-5.08  ±  0.0704 

-7.02  ±  0.0853 

-3.91  ±0.0241 

Table  5.1:  For  each  training  set,  the  table  shows  the  average  (over  30  runs)  test  set  log- 
likelihood  of  each  of  3  HPMs  (with  2,  3,  and  4  processes)  on  each  of  3  synthetic  data  sets 
(generated  with  2,  3,  and  4  processes).  Each  cell  is  reported  as  mean  ±  standard  deviation. 
NOTE:  All  values  in  this  table  are  *105. 


arrangements  of  the  symbols  *,  +,  and  $)  for  4  seconds  followed  by  a  blank  screen  for 
4  seconds,  followed  by  a  sentence  (e.g.  “The  star  is  above  the  plus.”)  for  4  seconds. 
Participants  could  press  buttons  indicating  whether  the  sentence  correctly  described  the 
picture  at  any  time  during  the  second  stimulus  presentation.  The  participants  then  rested 
for  15  seconds  before  the  next  trial  began.  In  the  other  half  of  the  trials  the  sentence  was 
presented  first  and  the  picture  second,  using  the  same  timing.  Figure  1.1  diagrams  the 
temporal  structure  of  the  trials. 

Imaging  was  carried  out  on  a  3.0  Tesla  G.E.  Signa  scanner.  A  T2*-weighted,  single¬ 
shot  spiral  pulse  sequence  was  used  with  TR  =  500  ms,  TE=  18  ms,  50-degree  flip  angle. 
This  sequence  allowed  us  to  acquire  8  oblique  axial  slices  every  500  ms,  with  an  in-plane 
resolution  of  3.125  millimeters  and  slice  thickness  of  3.2  mm,  resulting  in  approximately 
5000  voxels  per  participant.  The  8  slices  were  chosen  to  cover  areas  of  the  brain  believed 
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to  be  relevant  to  the  task  at  hand.  Each  participant  was  also  annotated  with  anatomical 
regions  of  interest  (ROIs).  The  data  were  preprocessed  to  remove  artifacts  due  to  head 
motion  and  signal  drift  using  the  FIASCO  program  Eddy  et  al.  [1999]. 


5.2.1  Models 

Our  experiments  compare  seven  different  HPMs,  and  compare  them  against  a  standard 
classifier.  In  this  section,  we  describe  the  details  of  each  model. 


GNB 

The  Gaussian  Naive  Bayes  classifier  (Mitchell  [1997])  used  in  our  experiments  operates 
on  the  8  second  window  following  each  stimulus  presentation.  It  considers  the  two  win¬ 
dows  per  trial  as  two  independent  examples  from  one  of  two  classes:  either  ReadSentence 
or  ViewPicture.  For  each  class,  a  mean  and  variance  are  estimated  for  each  feature  in¬ 
dependently.  In  this  case,  the  features  are  the  values  of  every  voxel  at  every  timepoint 
in  the  window  (16  images,  so  16  x  V  features,  where  V  is  the  number  of  voxels  being 
modeled).  Once  the  mean  and  variance  estimates  are  computed  from  the  training  set,  they 
are  used  to  compute  the  probability  of  a  test  example  belonging  to  one  class  or  the  other, 
again  under  the  assumption  that  all  features  are  independent.  The  class  with  the  higher 
probability  is  predicted  for  each  example,  and  the  classification  accuracy  for  a  given  fold 
is  the  percentage  of  examples  predicted  correctly. 

There  are  a  few  key  differences  between  the  GNB  approach  and  HPMs.  Firstly,  the 
GNB  considers  the  examples  corresponding  to  the  first  and  second  stimuli  of  each  trial 
in  the  test  set  independently  (i.e.  the  N  trials  become  2 N  separate  examples  instead  of 
pairs  of  examples).  Through  the  set  of  configurations,  HPMs  are  provided  with  the  prior 
knowledge  that  in  each  trial,  there  must  be  exactly  one  instance  of  ReadSentence  and 
exactly  one  of  ViewPicture.  Therefore  HPMs  cannot  make  the  mistake  of  predicting  two 
ReadSentence  instances  for  a  test  trial,  whereas  the  GNB  can.  Secondly,  the  GNB  has 
a  different  noise  model  from  HPMs.  The  GNB  estimates  a  variance  for  every  voxel  at 
every  time  point;  HPMs  estimate  a  variance  for  every  voxel,  independetly  of  time  and 
independently  of  process  (similar  to  class). 
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HPM-GNB 


The  HPM-GNB  model  approximates  the  GNB  through  the  HPM  framework.  It  has  two 
processes:  ReadSentence  and  ViewPicture.  The  duration  of  each  process  is  8  seconds  (16 
images)  so  that  they  may  not  overlap.  HPM-GNB  essentially  eliminates  the  differences 
between  HPMs  and  GNB  discussed  above.  It  has  a  variance  for  each  voxel,  and  incorpo¬ 
rates  the  knowledge  that  each  trial  has  exactly  one  instance  of  ReadSentence  and  exactly 
one  of  ViewPicture.  It  is  the  same  as  HPM-2-K  (presented  in  the  next  section)  except  that 
the  processes  are  shorter  so  that  they  may  not  overlap. 


HPM-2-K  and  HPM-2-U 

HPM-2-K  is  a  2-process  HPM  containing  just  ReadSentence  and  ViewPicture,  each  of 
which  has  a  duration  of  12  seconds  (a  reasonable  length  for  the  hemodynamic  response 
function).  The  ’K’  stands  for  ’known,’  meaning  that  the  timing  of  these  two  processes 
is  assumed  to  be  known.  More  specifically,  it  is  assumed  that  each  process  starts  exactly 
when  its  corresponding  stimulus  is  presented.  In  the  notation  of  Chapter  2,  f I  =  {0}  for 
both  processes.  For  a  given  test  set  trial,  the  configurations  for  HPM-2-K  are  shown  in 
Table  5.2.  The  configurations  for  a  training  set  trial  are  restricted  to  those  containing  the 
correct  order  of  the  processes  (based  on  the  order  of  the  stimuli). 


c 

vr(4) 

A(4) 

0(  4) 

A(4) 

0(4) 

1 

s 

1 

0 

P 

17 

0 

2 

p 

1 

0 

S 

17 

0 

Table  5.2:  Configurations  for  a  test  set  trial  under  HPM-2-K.  For  supervised  training, 
there  is  only  one  configuration  for  each  training  trial.  For  each  configuration  c,  4  is  the 
kth  process  instance,  7 r(4)  is  its  process  ID,  A(4)  is  its  landmark,  and  0(4)  is  its  offset. 

HPM-2-U  is  also  a  2-process  HPM  containing  just  ReadSentence  and  ViewPicture.  In 
this  case  though,  the  ’U’  stands  for  ’unknown,’  meaning  that  we  allow  a  small  amount  of 
uncertainty  about  the  start  times  of  the  processes.  In  HPM-2-U,  the  f)  for  each  process  is 
(0, 1},  which  translates  to  delays  of  0  or  0.5  seconds  (0  or  1  images)  following  the  relevant 
stimulus  presentation.  The  configurations  for  the  test  set  in  this  case  are  in  Table  5.3. 
Again,  during  supervised  training,  the  configurations  for  a  given  training  trial  are  limited 
to  those  that  have  the  correct  process  ordering.  Since  the  true  offset  for  the  processes  are 
unknown  even  during  training,  there  are  still  4  configurations  for  each  training  trial. 
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Table  5.3:  Configurations  for  a  test  set  trial  using  HPM-2-U.  For  supervised  training  where 
we  assume  the  order  of  the  stimuli  is  known,  there  will  be  only  four  configurations  for  each 
training  trial.  For  each  configuration  c,  4  is  the  fcth  process  instance,  7r (4)  is  its  process 
ID,  A (ik)  is  its  landmark,  and  0(4)  is  its  offset. 
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HPM-3-K  and  HPM-3-U 


HPM-3-K  and  HPM-3-U  add  a  Decide  process  to  the  previous  2  models,  also  with  a  dura¬ 
tion  of  12  seconds.  We  used  a  duration  of  12  seconds  despite  reaction  times  faster  than  that 
because  we  expected  that  the  hemodynamic  response  would  be  temporally  sluggish  com¬ 
pared  to  the  reaction  times.  In  both  cases,  the  Decide  process  has  fl  =  {0, 1,  2,  3, 4,  5,  6,  7}, 
allowing  the  process  to  start  with  a  delay  of  0  to  3.5  seconds  from  the  second  stimulus  pre¬ 
sentation.  For  these  HPMs,  we  can  use  the  information  we  have  about  the  participant’s  re¬ 
action  time  by  assuming  that  the  Decide  process  must  begin  before  the  participant  pushes 
the  button.  This  means  that  for  each  trial,  we  can  also  eliminate  any  configurations  in 
which  the  Decide  process  starts  after  the  button  press;  that  is,  any  configurations  for  which 
03  >  reaction -time.  Note  that  this  implies  that  different  trials  can  have  different  sets  of 
configurations  since  reaction  times  varied  from  trial  to  trial.  If  the  participant  did  not  press 
the  button  for  some  trial,  all  offsets  are  considered  for  the  Decide  process.  The  test  set 
configurations  for  HPM-3-K  for  a  trial  with  a  reaction  time  of  2.6  seconds  are  given  in 
Table  5.4.  Since  the  nearest  preceding  image  to  the  button  press  corresponds  to  offset  5, 
we  have  removed  configurations  with  03  G  {6,  7}.  To  do  supervised  training,  we  only  use 
the  configurations  with  the  correct  ordering  of  ReadSentence  and  ViewPicture.  The  con¬ 
figurations  for  HPM-3-U  are  the  analogous  extension  to  those  for  HPM-2-U;  that  is,  the 
configurations  in  Table  5.4  are  extended  to  allow  offsets  of  0  or  1  for  the  first  two  process 
instances. 
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Table  5.4:  Configurations  for  a  test  set  trial  under  HPM-3-K.  The  reaction  time  for  this  trial 
is  2.6  seconds,  which  corresponds  to  offset  5  for  the  Decide  process,  so  all  configurations 
with  offsets  greater  than  5  for  this  process  have  been  eliminated  for  this  trial.  For  each 
configuration  c,  ik  is  the  k\h  process  instance,  n(ik)  is  its  process  ID,  A (ik)  is  its  landmark, 
and  0(ik )  is  its  offset. 
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HPM-4-K  and  HPM-4-U 


The  final  two  rows  of  Table  5.10  show  results  for  HPMs  similar  to  HPM-3-K  and  HPM-3- 
U  called  HPM-4-K  and  HPM-4-U.  These  HPMs  include  another  process  called  PressB Lit¬ 
ton,  with  a  duration  of  12  seconds.  The  landmark  for  instances  of  this  process  is  the  time 
at  which  the  participant  pressed  the  button  rounded  to  the  nearest  preceding  image.  The 
offsets  are  fl  =  {  —  1,  0},  indicating  that  the  process  may  begin  0  or  0.5  seconds  before  the 
landmark.  If  the  participant  did  not  press  the  button  during  a  particular  trial,  this  process 
is  not  included  in  the  configurations  for  that  trial.  That  is,  we  are  treating  this  process  as 
observable  in  terms  of  its  existence  or  non-existence,  with  a  small  amount  of  uncertainty 
in  its  timing  when  it  does  exist.  Furthermore,  we  can  assume  that  Decide  must  begin  no 
later  than  PressButton,  and  remove  configurations  in  which  this  assumption  does  not  hold. 
The  configurations  for  HPM-4-K  for  a  test  trial  in  which  the  participant  pressed  the  button 
1.75  seconds  (between  offsets  3  and  4)  after  the  second  stimulus  presentation  are  shown 
in  Table  5.5.  If  there  was  no  button  press,  the  configurations  for  HPM-4-K  are  identical  to 
those  for  HPM-3-K  (and  those  for  HPM-4-U  match  the  set  for  HPM-3-U).  For  supervised 
training,  we  simply  limit  the  configurations  to  those  with  ReadSentence  and  ViewPicture 
in  the  correct  order. 
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Table  5.5:  Configurations  for  a  test  set  trial  using  HPM-4-K.  For  this  trial,  the  participant’s  reaction  time  was  1.75 
seconds,  so  the  landmark  for  PressButton  (B)  is  3  (the  nearest  preceding  image).  We  have  eliminated  configurations 
in  which  Decide  (D)  begins  after  PressButton.  For  each  configuration  c,  4  is  the  A:th  process  instance,  n (if.)  is  its 
process  ID,  A (4)  is  its  landmark,  and  0(4)  is  its  offset. 


5.2.2  Comparing  HPMs 


Given  that  we  are  studying  datasets  with  an  unknown  number  of  processes,  each  of  which 
may  have  unknown  properties,  how  can  we  compare  a  set  of  HPMs,  each  describing  a 
different  model  of  the  data,  against  each  other?  Below  we  explore  two  metrics:  accuracy 
in  predicting  aspects  of  the  dataset  that  we  do  know  (like  the  order  of  the  stimuli),  and 
the  log-likelihood  of  the  data  under  the  model.  Both  metrics  are  computed  within  the 
framework  of  cross-validation. 

For  each  metric,  we  first  report  an  experiment  using  all  the  HPMs  described  above 
trained  using  the  standard  parameterization  on  all  available  voxels  for  each  participant. 
Since  this  is  computationally  expensive,  we  then  restricted  our  experiments  to  the  top  1000 
most  active  voxels  for  each  participant  and  the  subset  of  HPMs  with  known  timing  for  the 
first  two  processes.  (It  does  seem  reasonable  to  assume  that  the  start  times  of  these  two 
processes  coincide  with  their  corresponding  stimulus  presentations.)  These  experiments 
compare  standard  HPMs,  regularized  HPMs,  and  HPMs  with  basis  functions. 

In  the  regularization  experiments,  the  temporal  smoothness  and  spatial  smoothness 
penalties  described  in  Chapter  3  were  weighted  by  a  regularization  parameter  chosen  using 
HPM-3-K  on  Participant  A.  From  the  set  of  powers  of  two  (with  exponents  from  0  to  10), 
the  value  with  the  highest  average  test  set  log-likelihood  over  baseline  was  at  5 12.  We  used 
the  temporal  smoothness  and  spatial  smoothness  penalties  to  reflect  our  prior  knowledge 
that  the  hemodynamic  response  is  based  on  blood  flow,  which  should  be  smooth  over  space 
and  time.  We  chose  not  to  use  the  spatial  sparsity  penalty  because  the  experiments  were 
already  limited  to  1000  voxels,  and  we  did  not  have  sufficient  prior  knowledge  to  use  the 
spatial  prior  penalty. 

The  HPMs  parameterized  with  basis  functions  used  the  basis  set  shown  in  Figure  5.8. 
These  basis  functions  were  generated  using  the  method  described  in  Section  3.2,  based  on 
the  work  in  Hossein-Zadeh  et  al.  [2003].  To  generate  this  basis,  we  first  created  a  matrix 
Q  containing  10,000  realizations  over  24  time  points  of  the  gamma  function: 


'l(i)  =  exp(-^)^ 

where  the  parameter  a  ranged  from  0.05  to  0.21  and  the  parameter  b  ranged  from  3  to  7. 
The  basis  set  is  the  first  three  principal  components  of  Q'Q. 
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Figure  5.8:  Basis  functions  used  to  parameterize  process  response  signatures  in  sentence- 
picture  experiments. 
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Data  Log-Likelihood 


To  compute  the  log-likelihood  of  a  single  trial  of  data  with  respect  to  a  given  HPM,  we 
compute  the  log-likelihood  of  the  trial  under  each  test  set  configuration  and  weight  that 
log-likelihood  by  the  posterior  probability  of  the  configuration.  Since  in  practice,  the  pos¬ 
terior  probability  distributions  tend  to  be  sharply  peaked,  almost  always  assigning  proba¬ 
bility  1  to  a  single  configuration,  this  can  be  thought  of  as  the  log-likelihood  of  the  maxi¬ 
mum  a  posteriori  (MAP)  configuration  for  each  trial.  For  each  configuration,  we  compute 
a  predicted  trial  by  adding  the  response  signature  of  each  process  instance  in  the  configu¬ 
ration  to  the  window  of  the  trial  in  which  it  is  active.  The  average  trial  (from  the  training 
set)  is  used  to  fill  in  any  time  points  in  the  predicted  trial  for  which  no  process  instance  is 
active.  This  predicted  trial  is  the  mean  used  for  the  log-likelihood  computation. 

Since  log-likelihood  has  no  concrete  units,  we  also  computed  a  baseline  log-likelihood 
from  a  naive  method,  and  we  compare  performance  across  models  and  participants  using 
improvement  over  this  baseline.  The  naive  method  used  to  create  the  baseline  is  to  simply 
average  the  time  courses  of  the  training  trials  without  regard  to  processes,  and  use  that 
time  course  as  the  mean  for  the  test  trials.  Note  that  this  baseline  averages  together  trials 
with  all  possible  orders  of  the  stimuli  (picture  followed  by  sentence  and  sentence  followed 
by  picture). 

Table  5.6  reports  the  average  improvement  over  baseline  test  set  log-likelihood  over  5 
folds  of  cross-validation  for  each  of  13  participants  under  seven  different  HPMs.  Table  5.7 
reports  the  same  metric  for  the  top  1000  most  active  voxels  for  each  participant  under  the 
restricted  set  of  four  HPMs.  Tables  5.8  and  5.9  give  the  corresponding  results  under  the 
same  experimental  conditions  as  Table  5.7,  but  for  regularized  and  basis  function  HPMs, 
respectively. 

The  first  result  of  note  is  that  for  the  standard  HPMs  (Tables  5.6  and  5.7)  and  for 
the  vast  majority  of  participant-HPM  combinations,  the  improvement  in  log-likelihood 
over  the  baseline  is  a  negative  number.  That  is,  the  data  log-likelihood  is  higher  using 
the  baseline  method  of  predicting  the  mean  training  trial  than  when  using  HPMs.  These 
results  indicate  that  standard  HPMs  are  overfitting  the  training  data.  The  presence  of 
overfitting  is  unsurprising  given  the  ratio  of  parameters  to  training  data.  (Recall  that  each 
process  in  an  HPM  has  DV  parameters  to  characterize  its  response  signature  (e.g.  DV  = 
24  x  1000),  and  only  N  trials  to  use  as  examples  (in  5-fold  cross-validation  over  40  trials, 
N  =  32).  This  is  in  addition  to  the  parameters  of  the  processes’  timing  distributions  and 
the  V  process-independent  noise  parameters.)  The  scores  get  worse  as  the  HPMs  become 
more  complex,  providing  more  evidence  for  overfitting. 

Table  5.8  suggest  that  regularization  is  able  to  mitigate  overfitting.  All  of  the  improve- 
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ments  in  data  log-likelihood  over  the  baseline  are  positive,  and  the  improvements  now 
increase  with  the  complexity  of  the  HPM.  As  a  general  trend,  it  appears  that  HPM-3-K 
and  HPM-4-K  outperform  HPM-GNB  and  HPM-2-K,  but  the  differences  between  HPM- 
3-K  and  HPM-4-K  are  small.  The  trends  for  HPMs  with  basis  functions  (Table  5.9)  are 
very  similar  to  those  for  regularized  HPMs.  On  average,  the  regularized  HPMs  just  barely 
outperform  HPMs  with  basis  functions.  Both  regularization  and  basis  functions  provide 
HPMs  with  a  temporal  smoothness  constraint,  which  appears  to  help  them  both  outper¬ 
form  standard  HPMs.  Regularization  also  provides  a  spatial  smoothness  constraint,  which 
is  not  present  in  the  basis  function  version  of  HPMs.  In  contrast,  basis  functions  provide 
a  significant  decrease  in  the  number  of  parameters  in  the  model.  The  fact  that  regularized 
HPMs  have  slightly  higher  performance  than  basis  function  HPMs  may  suggest  that  the 
spatial  smoothness  constraint  is  more  important  than  the  parameter  reduction.  Another 
possible  explanation  for  the  lower  performance  of  HPMs  with  basis  functions  is  that  the 
basis  set  used  here  might  not  be  expressive  enough  to  capture  the  correct  process  response 
signatures. 
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Participant 

GNB 

2-K 

2-U 

HPM- 

3-K 

3-U 

4-K 

4-U 

A 

0.162±0.29 

-0.487±0.059 

-0.178±0.13 

-1.24±0.21 

-1.03±0.26 

-2.01±0.16 

-1.75±0.20 

B 

0.217±0.22 

-0.196±0.12 

-0.102±0.066 

-1.12±0.23 

-1.07±0.21 

-2.17±0.15 

-2.03±0.18 

C 

-0.0940±0.33 

-0.475±0.17 

-0.360±0.14 

-1.32±0.27 

-1.18±0.26 

-2.74±0.24 

-2.47±0.35 

D 

-0.0677±0.13 

-0.326±0.096 

-0.264±0.16 

-1.23±0.13 

-1.1 1±0.12 

-2.60±0.37 

-2.47±0.34 

E 

0.0326±0.16 

-0.443±0.13 

-0.396±0.18 

-1.62±0.11 

-1.50±0.14 

-2.70±0.37 

-2.48±0.26 

F 

0.325±0.35 

-0.183±0.30 

0.0186±0.38 

-1.34±0.17 

-1.02±0.13 

-2.66±0.84 

-2.44±0.86 

G 

0.0652±0.055 

-0.31 1±0. 071 

-0.244±0.081 

-1.31±0.089 

-1.14±0.15 

-2.00±0.27 

-1.89±0.20 

H 

0.117±0.096 

-0.328±0.093 

-0.264±0.12 

-1.28±0.13 

-1.27 ±0.063 

-2.24±0.32 

-2.01±0.32 

I 

0.00216±0.21 

-0.322±0.23 

-0.261±0.22 

-1.14±0.19 

-1.03±0.15 

-2.14±0.30 

-2.02±0.30 

J 

-0.171±0.16 

-6.78±0.16 

-4.26±0.099 

-1.29±0.19 

-1.06±0.18 

-2.50±0.31 

-2.28±0.37 

K 

0.0357±0.24 

-0.273±0.16 

-0.169±0.23 

-1.15±0.24 

-1.01±0.20 

-2.55±0.43 

-2.36±0.56 

L 

0.228±0.20 

-3.95±0.058 

-3.20±0.090 

-1.73±0.064 

-1.64±0.049 

-3.41±0.68 

-3.26±0.60 

M 

0.0445±0.11 

-0.375±0.19 

-0.330±0.22 

-1.20±0.12 

-1.15±0.13 

-2.42±0.11 

-2.22±0.11 

Mean 

0.0690 

-1.11 

-0.770 

-1.31 

-1.17 

-2.47 

-2.28 

Table  5.6:  Improvement  over  baseline  of  log-likelihood,  ±  one  standard  deviation,  of  the  8-trial  test  set  under  the 
HPM  trained  on  the  remaining  32  trials  averaged  over  5-fold  cross-validation  for  13  participants.  All  values  are  x  104. 
The  baseline  is  the  log-likelihood  of  the  data  obtained  using  the  mean  of  all  the  training  trials  to  predict  each  test 
trial,  ignoring  stimulus  order  and  participant  response.  Note  that  negative  values  mean  that  the  baseline  outperformed 
the  model.  HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise  model  and  assumptions  as  HPMs.  HPM-N- 
K  is  an  HPM  with  N  processes  with  known  timing  for  ReadSentence  and  ViewPicture  (both  processes  are  forced 
to  begin  when  the  corresponding  stimulus  is  presented).  HPM-N-U  is  an  HPM  with  N  processes  with  unknown 
timing  for  ReadSentence  and  ViewPicture  (the  processes  may  begin  0  or  1  images  after  the  corresponding  stimulus 
is  presented). 


Participant 

GNB 

HPN 

2-K 

4- 

3-K 

4-K 

A 

-0.0533±0.11 

-0.224±0.038 

-0.163±0.077 

-0.327±0.063 

B 

0.0536±0.12 

0.0251±0.054 

-0.144±0.077 

-0.392±0.058 

C 

-0.116±0.15 

-0.802±0.067 

-0.236±0.065 

-0.488±0.097 

D 

-0.0924±0.059 

0.0192±0.069 

-0.0952±0.061 

-0.375±0.14 

E 

-0.0710±0.053 

-0.117±0.045 

-0.275±0.021 

-0.492±0.062 

F 

0.0924±0.15 

0.0155±0.15 

-0.196±0.068 

-0.445±0.18 

G 

-0.00969±0.025 

-0.0580±0.036 

-0.260±0.032 

-0.404±0.099 

H 

0.00556±0.043 

-0.0358±0.045 

-0.201±0.051 

-0.466±0.079 

I 

-0.0133±0.071 

-0.053 1±0.069 

-0.229±0.048 

-0.511±0.049 

J 

-0.126±0.095 

-0.165±0.091 

-0.207±0.019 

-0.477±0.071 

K 

-0.0654±0.12 

0.0304±0.11 

-0.106±0.13 

-0.402±0.13 

L 

0.0267±0.056 

-0.0817±0.023 

-0.279±0.031 

-0.574±0.098 

M 

-0.0124±0.074 

-0.0506±0.099 

-0.210±0.089 

-0.482±0.071 

Mean 

-0.0293 

-0.115 

-0.200 

-0.449 

Table  5.7:  For  the  baseline  HPM  with  the  top  1000  most  active  voxels,  improvement 
over  baseline  of  log-likelihood,  ±  one  standard  deviation,  of  the  8-trial  test  set  under 
the  HPM  trained  on  the  remaining  32  trials  averaged  over  5-fold  cross-validation  for  13 
participants.  All  values  are  xlO4.  The  baseline  is  the  log-likelihood  of  the  data  obtained 
using  the  mean  of  all  the  training  trials  to  predict  each  test  trial,  ignoring  stimulus  order  and 
participant  response.  Note  that  negative  values  mean  that  the  baseline  outperformed  the 
model.  HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise  model  and  assumptions 
as  HPMs.  HPM-N-K  is  an  HPM  with  N  processes  with  known  timing  for  ReadSentence 
and  ViewPicture  (both  processes  are  forced  to  begin  when  the  corresponding  stimulus  is 
presented). 
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Participant 

GNB 

HP 

2-K 

M- 

3-K 

4-K 

A 

0.194±0.15 

0.170±0.078 

0.525±0.12 

0.525±0.16 

B 

0.357±0.12 

0.492±0.073 

0.564±0.092 

0.558±0.096 

C 

0.181±0.17 

0.388±0.096 

0.478±0.086 

0.483±0.10 

D 

0.193±0.070 

0.459±0.082 

0.600±0.041 

0.610±0.065 

E 

0.211±0.080 

0.327±0.079 

0.413±0.064 

0.371±0.049 

F 

0.344±0.16 

0.417±0.18 

0.442±0.095 

0.369±0.11 

G 

0.265±0.035 

0.376±0.049 

0.408±0.55 

0.397±0.070 

H 

0.309±0.042 

0.4413=0.055 

0.516±0.067 

0.512±0.067 

I 

0.289±0.086 

0.413±0.095 

0.473±0.090 

0.462±0.082 

J 

0.185±0.082 

0.314±0.087 

0.506±0.071 

0.520±0.084 

K 

0.231±0.13 

0.489±0.14 

0.587±0.17 

0.579±0.17 

L 

0.315±0.078 

0.373±0.038 

0.427 ±0.051 

0.362±0.052 

M 

0.289±0.067 

0.424±0.092 

0.503±0.089 

0.507±0.097 

Mean 

0.259 

0.391 

0.496 

0.481 

Table  5.8:  For  HPMs  with  spatial  smoothness  and  temporal  smoothness  regularization 
penalties  on  the  top  1000  most  active  voxels,  improvement  over  baseline  of  log-likelihood, 
±  one  standard  deviation,  of  the  8-trial  test  set  under  the  HPM  trained  on  the  remaining 
32  trials  averaged  over  5-fold  cross-validation  for  13  participants.  All  values  are  xlO4. 
The  baseline  is  the  log-likelihood  of  the  data  obtained  using  the  mean  of  all  the  training 
trials  to  predict  each  test  trial,  ignoring  stimulus  order  and  participant  response.  Note  that 
negative  values  mean  that  the  baseline  outperformed  the  model.  HPM-GNB  is  a  Gaussian 
Naive  Bayes  with  the  same  noise  model  and  assumptions  as  HPMs.  HPM-N-K  is  an  HPM 
with  N  processes  with  known  timing  for  ReadSentence  and  ViewPicture  (both  processes 
are  forced  to  begin  when  the  corresponding  stimulus  is  presented). 
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Participant 

GNB 

HPN 

2-K 

4- 

3-K 

4-K 

A 

0.0465±0.091 

0.0371  ±0.093 

0.343±0.29 

0.383±0.22 

B 

0.318±0.075 

0.486±0.067 

0.531±0.090 

0.543±0.062 

C 

0.0653±0.083 

0.352±0.066 

0.416±0.051 

0.472±0.014 

D 

0.0693±0.15 

0.371±0.13 

0.477±0.12 

0.514±0.10 

E 

0.0535±0.076 

0.254±0.082 

0.325±0.073 

0.315±0.047 

F 

0.395±0.31 

0.476±0.33 

0.560±0.31 

0.479±0.23 

G 

0.224±0.089 

0.384±0.092 

0.405±0.097 

0.415±0.093 

H 

0.328±0.056 

0.471±0.078 

0.596±0.33 

0.608±0.060 

I 

0.275±0.10 

0.413±0.10 

0.486±0.10 

0.471  ±0.094 

J 

0.1483=0.085 

0.327±0.088 

0.482±0.088 

0.511±0.063 

K 

0.184±0.19 

0.469±0.15 

0.530±0.14 

0.533±0.20 

L 

0.2793=0.069 

0.401±0.072 

0.491  ±0.090 

0.452±0.064 

M 

0.223±0.10 

0.417±0.12 

0.483±0.11 

0.504±0.11 

Mean 

0.201 

0.374 

0.471 

0.477 

Table  5.9:  For  HPMs  parameterized  with  basis  functions  using  the  top  1000  most  active 
voxels,  improvement  over  baseline  of  log-likelihood,  ±  one  standard  deviation,  of  the  8- 
trial  test  set  under  the  HPM  trained  on  the  remaining  32  trials  averaged  over  5-fold  cross- 
validation  for  13  participants.  All  values  are  xlO4.  The  baseline  is  the  log-likelihood  of 
the  data  obtained  using  the  mean  of  all  the  training  trials  to  predict  each  test  trial,  ignoring 
stimulus  order  and  participant  response.  Note  that  negative  values  mean  that  the  baseline 
outperformed  the  model.  HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise 
model  and  assumptions  as  HPMs.  HPM-N-K  is  an  HPM  with  N  processes  with  known 
timing  for  ReadSentence  and  ViewPicture  (both  processes  are  forced  to  begin  when  the 
corresponding  stimulus  is  presented). 
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Classification  Accuracy 


As  an  alternative  to  cross-validated  data  log-likelihood,  we  consider  here  using  classifica¬ 
tion  accuracy  to  compare  HPMs.  First,  we  briefly  describe  how  the  classification  accuracy 
for  a  single  fold  of  the  cross-validation  is  computed  for  the  HPM  models.  On  a  given 
fold,  each  HPM  contains  a  set  of  configurations  for  each  of  the  32  training  trials  and  a 
set  of  configurations  for  each  of  the  8  test  trials.  We  estimate  the  parameters  of  the  HPM 
using  the  training  set  (using  the  learning  algorithms  in  Chapter  2)  and  use  them  to  com¬ 
pute  the  posterior  probability  of  each  test  set  configuration  for  each  test  trial  (using  the 
inference  algorithm  in  Chapter  2).  Given  the  posterior  probabilities  of  the  configurations 
for  a  particular  test  trial,  we  can  compute  the  marginal  probability  of  any  property  of  the 
configurations.  For  example,  we  can  sum  the  posterior  probabilites  of  all  the  configura¬ 
tions  whose  first  process  instance  is  ViewPicture  to  get  the  marginal  probability  that  the 
first  process  instance  of  this  test  trial  is  ViewPicture.  In  fact,  we  compute  these  marginal 
probabilities  for  all  combinations  of  process  instance  and  process  ID  independently.  We 
can  then  make  a  prediction  by  choosing  the  process  ID  with  the  highest  marginal  proba¬ 
bility  for  each  process  ID.  For  each  trial,  this  prediction  is  compared  with  the  true  process 
IDs.  The  fraction  correct  on  each  trial  is  averaged  over  trials  to  obtain  the  classification 
accuracy  over  the  whole  test  set.  This  is  repeated  for  each  fold  of  the  cross-validation,  and 
those  results  are  once  again  averaged  to  produce  a  single  classification  accuracy.  The  stan¬ 
dard  deviations  reported  in  Table  5.10  are  computed  across  folds  of  the  cross-validation 
using  this  mean. 

Table  5.10  presents  5-fold  cross-validated  accuracies  for  the  task  of  classifying  Read- 
Sentence  and  ViewPicture  processes  for  eight  different  models  in  all  13  participants,  using 
standard  HPMs.  Table  5.11  reports  the  same  accuracies,  but  restricted  to  the  top  1000  most 
active  voxels  for  each  participant,  and  the  smaller  set  of  models.  Tables  5.12  and  5.13  give 
the  corresponding  results  for  HPMs  with  regularization  and  basis  functions,  respectively. 

The  first  general  result  of  note  involving  classification  accuracies  is  that  on  average, 
HPM-2-K  performs  slightly  better  than  HPM-GNB,  but  adding  the  third  and  fourth  pro¬ 
cesses  to  the  model  tends  to  degrade  performance.  That  HPM-2-K  improves  upon  HPM- 
GNB  suggests  that  modeling  the  temporal  overlap  between  the  first  two  processes’  re¬ 
sponse  signatures  is  helpful.  The  decreased  performance  for  the  more  complex  HPMs 
may  or  may  not  be  surprising.  On  the  one  hand,  we  might  expect  that  adding  the  Decide 
process  would  improve  classification  accuracies  among  the  first  two  processes  by  decon¬ 
volving  the  beginning  of  the  third  processes  from  the  end  of  the  second  process.  How¬ 
ever,  this  assumes  that  the  Decide  processes  begin  soon  enough  after  the  second  stimuli  to 
have  an  impact  on  the  estimation  of  ViewPicture  and  ReadSentence,  and  that  the  response 
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signature  of  the  third  process  has  significant  spatial  overlap  with  the  first  two.  If  these 
assumptions  are  not  met,  the  addition  of  a  third  process  could  be  detrimental  because  of 
the  additional  parameters  that  the  learning  algorithm  to  estimate,  taking  focus  away  from 
the  two  processes  involved  in  the  classification  task.  In  our  experiments,  it  appears  that 
the  potential  benefits  of  deconvolving  contributions  of  the  third  process  from  the  first  two 
processes  is  outweighed  by  the  additional  parameters  required  to  add  the  Decide  process. 

Another  general  trend  in  the  results  reporting  classification  accuracies  is  that  none  of 
the  HPMs  beat  the  classification  scores  of  a  Gaussian  Naive  Bayes  classifier.  In  consider¬ 
ing  this  result,  it  is  important  to  recall  that  GNB  has  a  different  noise  model  than  HPMs. 
GNB  has  a  feature  for  each  voxel  at  each  image  (time  point)  of  the  8  seconds  following  the 
stimulus.  For  each  feature,  GNB  estimates  a  mean  and  variance  parameter  for  each  class. 
In  contrast,  HPMs  have  process-independent  (analagous  to  class-independent)  noise  pa¬ 
rameters  for  each  voxel.  GNB  and  HPM-GNB  provide  a  direct  comparison  of  the  two 
noise  models,  showing  that  the  class-specific  noise  model  in  GNB  is  advantageous.  While 
HPMs  do  not  classify  the  stimuli  as  well  as  GNB,  they  do  tend  to  perform  better  than 
random  in  most  cases.  (Since  this  is  a  binary  classification  task,  the  null  hypothesis  is  that 
for  each  fold,  the  number  of  correct  classifications  achieved  by  a  classifier  with  no  dis¬ 
criminative  power  will  be  Binomial  with  parameters  p  =  0.5  and  Ar  =  8.  At  significance 
level  a  =  0.05,  we  would  reject  this  null  hypothesis  for  X  >  6  correct  classifications, 
or  accuracy  of  0.75.)  This  suggests  that  even  though  HPMs  are  not  optimized  to  find  the 
decision  boundary  between  ViewPicture  and  ReadSentence,  they  do  learn  information  that 
discriminates  between  these  two  processes  to  some  extent. 

Finally,  we  note  that  on  average,  HPMs  parameterized  with  basis  functions  slightly 
outperform  regularized  HPMs,  which  outperform  the  standard  HPMs  in  terms  of  classi¬ 
fication  accuracy  on  this  task.  This  is  in  contrast  to  the  comparison  in  terms  of  held-out 
data  log-likelihood,  where  regularized  HPMs  outperformed  HPMs  with  basis  functions 
on  average.  The  different  metrics  lend  themselves  to  slightly  different  interpretations. 
The  log-likelihood  metric  speaks  to  the  ability  of  the  HPM  in  question  to  track  the  time 
series,  whereas  the  classification  metric  addresses  the  ability  of  the  HPM  to  model  the 
decision  boundary  of  the  classification  task.  The  metrics  are  related  through  the  poste¬ 
rior  distribution  on  configurations.  The  log-likelihood  metric  is  a  weighted  average  of 
the  log-likelihood  under  each  configuration  weighted  by  its  posterior  probability,  which 
in  practice  generally  corresponded  to  the  log-likelihood  under  the  maximum  a  posteriori 
configuration.  The  classification  metric  computes  the  marginal  probabilities  for  the  clas¬ 
sification  task  from  the  posterior  distribution  over  configurations,  which  also  generally 
corresponded  to  the  classification  from  the  MAP  configuration.  HPMs  are  optimized  in 
training  for  log-likelihood,  so  we  suggest  using  the  log-likelihood  metric  as  the  primary 
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evaluation  criterion,  and  using  the  classification  accuracy  secondarily. 
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Participant 

GNB 

GNB 

2-K 

2-U 

HPM- 

3-K 

3-U 

4-K 

4-U 

A 

80.0±16.2 

62.5±17.7 

67.53=11.2 

62.53=17.7 

65.0±20.5 

72.5±20.5 

75.03=14.4 

64.23=15.8 

B 

92.5±8.2 

92.5±11.2 

97.53=5.6 

97.53=5.6 

95.0±6.9 

95.03=6.9 

93.3±5.6 

90.8±11.2 

C 

95.0±8.1 

82.53=14.3 

82.5±14.2 

75.03=17.7 

72.53=16.3 

77.53=10.5 

74.23=9.0 

69.23=14.0 

D 

100.0±0.0 

92.53=6.8 

97.5±5.6 

100.0±0.0 

87.53=8.8 

87.5±8.8 

74.23=12.3 

80.03=7.5 

E 

95.0±5.2 

87.5±8.8 

77.53=10.5 

70.0±20.9 

80.03:16.8 

75.03=12.5 

79.23=19.3 

81.7±21.4 

F 

73.8±19.5 

80.0±11.2 

75.03=8.8 

80.03=11.2 

82.53=14.3 

72.53=10.5 

80.03=8.5 

72.53=10.5 

G 

93.8±4.4 

70.0±19.0 

62.5±17.7 

70.0±6.8 

67.5±16.8 

60.03=10.5 

84.2±10.4 

80.8±7.6 

H 

87.5±9.9 

90.03=10.5 

82.5±6.8 

90.0±10.5 

82.53=14.3 

85.0±10.5 

82.5±17.0 

79.2±6.6 

I 

88.8±12.0 

82.5±6.8 

82.53=1 1.2 

80.03=16.8 

75.0±15.3 

75.0±17.7 

80.8±12.7 

76.7±12.0 

J 

93.8±4.4 

80.0±11.2 

70.0±16.8 

75.0±19.8 

82.53=19.0 

85.0±16.3 

78.3±16.0 

73.33=16.3 

K 

96.3±8.4 

75.03=19.8 

90.03=16.3 

87.53=21.7 

87.53=15.3 

85.0±16.3 

81.7±14.0 

81.7±14.0 

L 

71.3±23.6 

57.5±16.8 

62.5±15.3 

65.0±20.5 

67.5±20.9 

52.5±18.5 

52.5±19.5 

57.53=14.8 

M 

97.5±3.4 

80.0±14.3 

82.53=14.3 

82.53=14.3 

67.5±16.8 

70.03=19.0 

83.33=13.5 

80.0±14.3 

Mean 

89.6 

79.4 

72.9 

79.6 

77.9 

76.3 

78.4 

76.0 

Table  5.10:  Percent  accuracy  ±  one  standard  deviation  at  classifying  the  first  two  process  instances  per  trial  as  ei¬ 
ther  ReadSentence  or  ViewPicture  averaged  over  5-fold  cross-validation  for  13  participants.  GNB  is  the  traditional 
Gaussian  Naive  Bayes  classifier.  HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise  model  and  assumptions 
as  HPMs.  HPM-N-K  is  an  HPM  with  N  processes  with  known  timing  for  ReadSentence  and  ViewPicture  (both 
processes  are  forced  to  begin  when  the  corresponding  stimulus  is  presented).  HPM-N-U  is  an  HPM  with  N  pro¬ 
cesses  with  unknown  timing  for  ReadSentence  and  ViewPicture  (the  processes  may  begin  0  or  1  images  after  the 
corresponding  stimulus  is  presented). 


Participant 

GNB 

GNB 

HP 

2-K 

M- 

3-K 

4-K 

A 

86.3±14.3 

67.5±6.8 

67.5±16.8 

67.53=16.8 

67.53=16.8 

B 

92.5±10.3 

97.5±5.6 

97.5±5.6 

97.53=5.6 

90.8±11.2 

C 

97.5±5.6 

85.0±16.3 

90.0±16.3 

90.0±16.3 

85.03=16.3 

D 

97.5±3.4 

95.0±11.2 

97.5±5.6 

97.53=5.6 

85.8±7.0 

E 

100.0±0.0 

97.5±5.6 

92.5±11.2 

85.03=16.3 

89.23=12.0 

F 

81.3±21.7 

80.0±6.8 

85.03=10.5 

85.03=5.6 

77.53=12.7 

G 

97.5±5.6 

77.5±10.5 

72.5±13.7 

80.0±14.3 

81.7±8.1 

H 

96.3±5.6 

92.5±11.2 

95.0±6.8 

90.0±6.8 

88.3±4.6 

I 

92.5±6.8 

87.5±8.8 

85.03=16.3 

85.0±13.7 

88.33=12.3 

J 

96.3±3.4 

92.5±11.2 

95.03=11.2 

95.0±6.8 

88.3±11.6 

K 

93.8±8.8 

82.5±16.8 

92.5±16.8 

92.53=16.8 

86.7±16.2 

L 

78.8±14.4 

67.5±14.3 

70.0±14.3 

77.5±16.3 

69.2±18.5 

M 

100.0±0.0 

92.5±6.8 

92.5±1 1.2 

85.03=16.3 

91.7±5.9 

Mean 

93.1 

85.8 

87.1 

86.7 

83.8 

Table  5.11:  For  the  baseline  HPM  with  the  top  1000  most  active  voxels,  percent  accuracy 
±  one  standard  deviation  at  classifying  the  first  two  process  instances  per  trial  as  either 
ReadSentence  or  ViewPicture  averaged  over  5-fold  cross-validation  for  13  participants. 
GNB  is  the  traditional  Gaussian  Naive  Bayes  classifier.  HPM-GNB  is  a  Gaussian  Naive 
Bayes  with  the  same  noise  model  and  assumptions  as  HPMs.  HPM-N-K  is  an  HPM  with 
N  processes  with  known  timing  for  ReadSentence  and  ViewPicture  (both  processes  are 
forced  to  begin  when  the  corresponding  stimulus  is  presented). 
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Participant 

GNB 

GNB 

HP 

2-K 

M- 

3-K 

4-K 

A 

86.3±14.3 

75.0  ±8.8 

70.0±19.0 

75.0±17.7 

70.0±18.3 

B 

92.5±10.3 

97.5±5.6 

97.5±5.6 

95.0±11.2 

90.8=b  11.2 

C 

97.5±5.6 

85.0±16.3 

95 .0=b  11.2 

92.5±16.8 

90.0±13.7 

D 

97.5±3.4 

95.0±1 1.2 

100.0±0.0 

97.5±5.6 

85.8±7.0 

E 

100.0±0.0 

100.0±0.0 

92.5±11.2 

95.0±6.8 

89.2±8.1 

F 

81.3±21.7 

80.0±6.8 

85.0±10.5 

80.0±14.3 

76.7±15.2 

G 

97.5±5.6 

85.0±10.5 

77.5±18.5 

77.5±16.3 

78.3±6.8 

H 

96.3±5.6 

95.0±6.8 

97.5±5.6 

90.0±5.6 

88.3±4.6 

I 

92.5±6.8 

77.5±10.5 

90.0±10.5 

85.0±10.5 

85.8±7.6 

J 

96.3±3.4 

92.5±6.8 

97.5±5.6 

95.0±6.8 

93.3±6.3 

K 

93.8±8.8 

80.0±19.0 

92.5±16.8 

92.5±16.8 

85.0±19.9 

L 

78.8±14.4 

67.5±14.3 

77.5±16.3 

80.0±14.3 

70.8±16.4 

M 

100.0±0.0 

95.0±6.8 

97.5±5.6 

97.5±5.6 

93.3±5.6 

Mean 

93.1 

86.5 

90.0 

88.7 

84.4 

Table  5.12:  For  HPMs  with  spatial  and  temporal  smoothness  regularization  penalties  on 
the  top  1000  most  active  voxels,  percent  accuracy  ±  one  standard  deviation  at  classifying 
the  first  two  process  instances  per  trial  as  either  ReadSentence  or  ViewPicture  averaged 
over  5-fold  cross-validation  for  13  participants.  GNB  is  the  traditional  Gaussian  Naive 
Bayes  classifier.  HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise  model  and 
assumptions  as  HPMs.  HPM-N-K  is  an  HPM  with  N  processes  with  known  timing  for 
ReadSentence  and  ViewPicture  (both  processes  are  forced  to  begin  when  the  correspond¬ 
ing  stimulus  is  presented). 
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Participant 

GNB 

GNB 

HP 

2-K 

M- 

3-K 

4-K 

A 

86.3±14.3 

72.5±13.7 

72.5±20.5 

72.53=13.7 

82.53=9.9 

B 

92.5±10.3 

97.5±5.6 

97.5±5.6 

92.53=11.2 

88.33=11.2 

C 

97.5±5.6 

90.0±13.7 

87.5±12.5 

92.53=11.2 

92.5±7.5 

D 

97.5±3.4 

100.0±0.0 

100.0±0.0 

97.5±5.6 

85.8±7.0 

E 

100.0±0.0 

100.0±0.0 

95.03=11.2 

100.0±0.0 

94.2±6.3 

F 

81.3±21.7 

80.0±11.2 

90.0±5.6 

85.0±5.6 

79.2±8.3 

G 

97.5±5.6 

82.5±14.3 

80.03=11.2 

82.53=14.3 

79.2±7.8 

H 

96.3±5.6 

97.5±5.6 

97.5±5.6 

92.5±6.8 

93.3±3.7 

I 

92.5±6.8 

85.0±13.7 

87.53=8.8 

85.03=10.5 

80.83=10.5 

J 

96.3±3.4 

95.0±6.8 

97.53=5.6 

97.53=5.6 

90.83=5.4 

K 

93.8±8.8 

92.5±11.2 

90.03=16.3 

90.03=16.3 

86.7±11.9 

L 

78.8±14.4 

82.5±19.0 

87.5±8.8 

87.53=8.8 

69.2±23.9 

M 

100.0±0.0 

100.0±0.0 

95.03=11.2 

100.0±0.0 

97.53=2.3 

Mean 

93.1 

90.4 

90.6 

90.4 

86.2 

Table  5.13:  For  HPMs  parameterized  with  basis  functions  using  the  top  1000  most  ac¬ 
tive  voxels,  percent  accuracy  ±  one  standard  deviation  at  classifying  the  first  two  pro¬ 
cess  instances  per  trial  as  either  ReadSentence  or  ViewPicture  averaged  over  5-fold  cross- 
validation  for  13  participants.  GNB  is  the  traditional  Gaussian  Naive  Bayes  classifier. 
HPM-GNB  is  a  Gaussian  Naive  Bayes  with  the  same  noise  model  and  assumptions  as 
HPMs.  HPM-N-K  is  an  HPM  with  N  processes  with  known  timing  for  ReadSentence 
and  ViewPicture  (both  processes  are  forced  to  begin  when  the  corresponding  stimulus  is 
presented). 
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5.2.3  Visualizing  Learned  HPMs 


The  above  section  used  5-fold  cross-validation  to  estimate  the  performance  of  a  variety 
of  models  in  terms  of  classification  accuracy  and  data  log-likelihood.  Of  course,  the  best 
model  we  can  learn  uses  all  available  data.  For  this  section,  we  trained  a  variety  of  models 
using  all  available  trials  and  voxels. 

We  can  visualize  the  resulting  HPMs  in  a  number  of  ways.  To  see  the  spatial  response 
for  a  process,  we  can  average  the  response  for  each  voxel  over  time  and  plot  the  result 
over  the  whole  brain.  Examples  of  this  type  of  visualization  are  given  in  Figures  5.9  and 
5.10.  These  figures  were  extracted  from  HPM  ’browsers’  written  in  Matlab.  Appendix 
F  contains  images  of  the  processes  of  HPM-3-K  learned  using  standard,  regularized,  and 
basis  function  versions  of  HPMs  on  Participant  D  (selected  for  generally  high  performance 
across  metrics  and  algorithms).  At  this  level  of  aggregation,  the  differences  between  the 
three  versions  of  HPMs  are  almost  impossible  to  see. 

Another  way  to  visualize  the  learned  HPMs  is  to  plot  the  temporal  response  for  a  pro¬ 
cess  in  any  given  voxel.  Figures  5.11,  5.12,  and  5.13  show  the  time  course  of  a  single 
voxel  in  the  right  dorso-lateral  prefrontal  cortex  for  standard,  regularized,  and  basis  func¬ 
tion  versions  of  HPMs  respectively,  again  for  Participant  D.  Unlike  the  images  that  are 
averaged  over  time,  these  plots  plainly  show  differences  among  the  methods,  with  the 
temporal  smoothness  of  the  time  course  increasing  from  standard  to  regularized  to  basis 
function  HPMs. 

We  can  also  compare  the  learned  timing  distributions  for  the  Decide  process  across 
methods.  These  results  (again  for  Participant  D)  are  shown  in  Tables  5.14,  5.15,  and 
5.16,  for  standard,  regularized,  and  basis  function  versions  of  HPMs  respectively.  The 
standard  and  regularized  HPMs  have  very  similar  timing  distributions  for  Decide,  whereas 
the  distribution  for  HPMs  with  basis  functions  is  skewed  more  toward  the  beginning  of  the 
interval. 

These  visualizations  are  important  for  interpretating  the  learned  HPMs,  and  being  able 
to  visualize  them  at  all  for  a  process  with  an  onset  that  is  both  unknown  and  variable 
from  one  trial  to  the  next  is  a  major  contribution  of  HPMs.  Recall  that  the  true  meaning 
of  this  process  is  unsupervised;  it  is  the  process  learned  from  the  training  set  that  begins 
between  the  second  stimulus  and  the  button  press.  To  fully  understand  the  process,  we 
must  examine  the  process  response  signature  and  timing  distribution  in  the  context  of  the 
study. 


86 
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Figure  5.9:  Screen  shot  of  an  HPM  viewer  written  in  Matlab.  Drop-down  menus  select 
from  the  following  views:  observed  data  (for  any  particular  trial),  observed  data  averaged 
over  all  trials,  predicted  data  (for  any  particular  trial),  and  process  response  signatures  (for 
any  particular  process).  Anatomical  regions  of  interest  (ROIs)  can  also  be  highlighted,  as 
demonstrated  here  for  the  visual  cortex. 
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subject  04847,  avgObserved  Trial,  Z=8  amplitude=[-29.4  40.1] 


i* 

z- 


Process: 

comprehend ...  v 

Timing  Distribution: 

Offset  Probability 
01.0 


Trial 


Display  ROI 


CALC  v 

Display  image 
avgObserved  v 


Figure  5.10:  Screen  shot  of  an  HPM  viewer  written  in  Matlab.  The  viewer  displays  a  single 
z-slice  at  a  time,  and  clicking  on  a  voxel  brings  up  a  new  plot  of  the  temporal  response 
for  that  voxel  (which  was  averaged  to  produce  the  original  plot).  Drop-down  menus  select 
from  the  following  views:  observed  data  (for  any  particular  trial),  observed  data  averaged 
over  all  trials,  predicted  data  (for  any  particular  trial),  and  process  response  signatures  (for 
any  particular  process).  Anatomical  regions  of  interest  (ROIs)  can  also  be  highlighted. 
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4.5 


Figure  5.11:  The  timecourse  of  the  response  signature  for  the  Decide  process  in  HPM- 
3-K  learned  from  all  trials  and  all  voxels  using  standard  HPMs  for  one  voxel  in  the  right 
dorsolateral  prefrontal  cortex  (in  z-slice  5). 
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Time  (in  images) 

Figure  5.12:  The  timecourse  of  the  response  signature  for  the  Decide  process  in  HPM-3-K 
learned  from  all  trials  and  all  voxels  using  regularized  HPMs  for  one  voxel  in  the  right 
dorsolateral  prefrontal  cortex  (in  z- slice  5). 
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Figure  5.13:  The  timecourse  of  the  response  signature  for  the  Decide  process  in  HPM-3-K 
learned  from  all  trials  and  all  voxels  using  HPMs  with  basis  functions  for  one  voxel  in  the 
right  dorsolateral  prefrontal  cortex  (in  z- slice  5). 
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Offset: 

0 

1 

2 

3 

4 

5 

6 

7 

Theta: 

0.2957 

0.0762 

0.1006 

0.0518 

0.0518 

0.1982 

0.0762 

0.1494 

Table  5.14:  Timing  distribution  for  the  Decide  process  in  HPM-3-K  learned  from  all  trials 
and  all  voxels  using  standard  HPMs. 


Offset: 
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3 

4 

5 

6 

7 

Theta: 

0.2957 

0.0762 

0.1014 

0.0518 

0.0518 

0.1973 

0.0762 

0.1494 

Table  5.15:  Timing  distribution  for  the  Decide  process  in  HPM-3-K  learned  from  all  trials 
and  all  voxels  of  Particpant  D  using  regularized  HPMs. 


Offset: 
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5 
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Theta: 

0.525 

0.1 

0.1 

0.075 

0.05 

0.025 

0.05 

0.075 

Table  5.16:  Timing  distribution  for  the  Decide  process  in  HPM-3-K  learned  from  all  trials 
and  all  voxels  of  Particpant  D  using  HPMs  with  basis  functions. 
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This  chapter  has  provided  the  results  of  a  variety  of  experiments  on  HPMs,  using 
both  the  real  sentence-picture  experiment  fMRI  data  and  synthetic  data.  The  next  chapter 
discusses  these  results  and  summarizes  our  conclusions  for  Hidden  Process  Models. 
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Chapter  6 
Conclusions 


This  thesis  has  presented  Hidden  Process  Models,  as  inspired  by  the  fMRI  domain.  We 
detailed  the  model  formalism  and  inference  and  learning  algorithms  in  Chapter  2.  In  Chap¬ 
ter  3,  we  described  extensions  to  the  standard  version  of  HPMs,  including  regularization 
penalties  for  the  process  response  signatures,  a  reparameterization  of  the  process  response 
signatures  using  basis  functions,  and  a  version  of  HPMs  with  continuous  process  start 
times.  Chapter  4  discussed  the  theoretical  correspondence  between  HPMs  and  Dynamic 
Bayes  Nets,  and  gave  suggestions  for  how  the  models  could  be  learned  in  that  framework. 
In  Chapter  5  we  provided  experimental  results  evaluating  the  performance  of  HPMs  on 
real  and  synthetic  datasets,  and  we  gave  examples  of  how  the  learned  models  could  be 
visualized. 

In  this  chapter,  we  review  the  results  presented  in  Chapter  5.  We  then  suggest  future 
directions  for  work  on  HPMs,  provide  some  advice  to  potential  users  of  HPMs,  and  revisit 
related  work  to  illustrate  the  contributions  of  HPMs.  Finally,  we  summarize  and  conclude. 


6.1  Discussion  of  Results 

The  synthetic  data  results  in  Chapter  5  provide  a  proof-of-concept  for  the  HPM  approach. 
We  showed  that  for  data  that  is  consistent  with  the  assumptions  of  HPMs,  HPMs  can  ac¬ 
curately  recover  the  process  response  signatures.  We  also  demonstrated  that  held-out  data 
log-likelihood  can  be  used  to  select  the  HPM  that  truly  generated  the  data  from  among 
several  HPMs  with  different  numbers  of  processes.  These  results  are  important  because 
they  are  obtained  using  data  simulated  to  mimic  the  real  fMRI  experiment,  but  in  a  sit¬ 
uation  where  we  know  the  HPM  modeling  assumptions  are  correct  and  where  we  know 
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ground  truth.  Two  key  components  of  the  thesis  of  this  research  were  to  recover  the  spatial- 
temporal  signatures  of  processes  with  uncertain  and  variable  onset,  and  to  compare  several 
models  of  an  fMRI  study.  Since  in  the  real  data  we  do  not  know  the  true  process  response 
signatures  or  the  true  number  of  processes  underlying  the  data,  we  provide  the  synthetic 
data  results  to  demonstrate  success  at  both  of  these  tasks  where  we  do  have  ground  truth. 

The  results  comparing  models  and  methods  using  cross-validated  data  log-likelihood 
improvement  over  the  baseline  on  real  fMRI  data  indicated  that  HPMs  trained  using  the 
standard  parameterization  and  algorithm  tend  to  overfit,  as  evidenced  by  log-likelihoods 
that  are  lower  than  the  baseline.  However,  the  HPMs  using  regularization  or  basis  func¬ 
tions  do  not  appear  to  overfit,  consistently  outperforming  the  baseline  method.  HPMs 
with  regularization  generally  perform  similarly  to  HPMs  with  basis  functions.  The  im¬ 
provement  over  standard  HPMs  that  regularization  and  basis  functions  have  in  common 
is  temporal  smoothness  constraints  on  the  process  response  signatures,  and  our  results 
suggest  that  these  constraints  are  important. 

Chapter  5  also  provided  cross-validated  classification  accuracies  on  the  ViewPicture 
vs.  ReadSentence  task  for  the  real  fMRI  experiment.  These  accuracies  are  helpful  because 
they  have  a  concrete  interpretation,  whereas  the  scale  of  the  data  log-likelihood  scores  is 
harder  to  interpret.  However,  they  also  measure  performance  on  a  task  that  only  involves 
the  first  part  of  the  trial,  and  aggregate  spatial  and  temporal  information  (which  may  be 
interesting  in  its  own  right)  into  a  binary  decision.  We  saw  that  HPM-2-K  outperformed 
HPM-GNB  on  this  classification  task,  where  the  difference  between  those  two  models  was 
that  HPM-2-K  allowed  the  ViewPicture  and  ReadSentence  processes  to  have  temporal 
overlap.  Another  trend  in  the  classification  results  was  that  the  HPMs  containing  more 
than  just  the  two  processes  involved  in  the  classification  task  did  not  perform  as  well 
as  HPMs  with  just  ViewPicture  and  ReadSentence.  We  suggest  that  this  is  because  the 
extra  processes  being  modeled  in  the  more  complex  HPMs  do  not  have  much  spatial- 
temporal  overlap  with  the  first  two  processes.  If  there  were  more  spatial-temporal  overlap, 
modeling  the  Decide  and/or  PressButton  processes  might  aid  classification,  for  example 
by  deconvolving  the  beginning  of  the  third  process  from  the  end  of  the  second.  However  in 
the  absence  of  signification  spatial-temporal  overlap,  the  additional  parameters  required  to 
model  the  extra  processes  may  force  the  algorithms  to  make  compromises  at  the  expense 
of  the  ViewPicture  and  ReadSentence  process,  degrading  performance. 

In  general,  our  results  show  that  HPMs  do  not  outperform  a  Gaussian  Naive  Bayes 
classifier  on  this  task,  but  we  note  that  GNB  has  some  advantages  over  HPMs  for  clas¬ 
sification,  including  the  ability  to  model  class-conditional  variances  over  time  and  space, 
as  opposed  to  the  process-independent,  time-independent,  voxel-wise  variances  in  HPMs. 
We  note  that  it  would  not  be  difficult  to  extend  HPMs  to  model  a  variance  for  each  point 
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in  time  and  space  over  the  trials,  as  long  as  all  trials  are  the  same  length.  We  could  also 
imagine  an  extension  to  the  generative  model  that  includes  variances  that  depend  on  the 
processes,  but  this  would  require  design  choices  about  how  to  combine  variances  when 
process  instances  overlap,  and  how  to  model  variance  when  no  process  instances  are  ac¬ 
tive.  Process-specific  variances  may  also  introduce  additional  challenges  for  learning  the 
model.  While  HPM  accuracies  were  often  lower  than  GNB  accuracies,  they  were  usually 
well  above  naive  guessing,  indicating  that  while  HPMs  are  not  optimized  for  classification, 
they  can  still  recover  some  information  helpful  for  discriminating  among  proceses. 

In  addition  to  estimating  HPM  performance  with  cross-validation,  we  have  demon¬ 
strated  how  trained  HPMs  can  be  visualized  by  modelers,  which  can  be  helpful  for  ex¬ 
ploratory  data  analysis.  We  must  reiterate  though,  that  it  is  critical  to  consider  the  assump¬ 
tions  of  the  model  carefully  when  browsing  learned  HPMs. 


6.2  Advice  on  Using  HPMs 

When  considering  applying  HPMs  to  an  fMRI  study,  the  goal  of  the  analysis  should  be 
kept  in  mind.  If  the  task  of  interest  is  to  classify  among  a  set  of  stimuli  based  on  the  fMRI 
data  in  a  fixed-length  window  beginning  at  the  stimulus  presentation,  HPMs  may  not  be 
the  right  tool.  There  are  numerous  other  classification  techniques  that  are  optimized  more 
specifically  for  finding  decision  boundaries,  and  as  we  have  demonstrated,  HPMs  do  not 
outperform  Gaussian  Naive  Bayes  on  the  ReadSentence  vs.  ViewPicture  classification 
task.  However,  if  the  main  interest  in  the  study  involves  a  cognitive  process  whose  onset 
is  not  perfectly  tied  to  the  stimulus  timing,  HPMs  may  be  quite  relevant. 

Also  of  note  is  that  the  trial  structure  of  the  sentence-picture  fMRI  data  provides  im¬ 
portant  conditional  independencies  that  allow  for  more  compact  representation  of  configu¬ 
rations.  Since  trials  are  long  enough  to  allow  brain  activity  to  return  to  a  baseline  after  the 
mental  task,  there  are  timepoints  twoard  the  end  of  each  trial  where  no  processes  are  ac¬ 
tive.  This  allows  configurations  to  be  specified  at  the  trial-level,  independently  of  the  set  of 
configurations  for  the  neighboring  trials;  that  is,  the  trial- specific  latent  indicator  variables 
are  conditionally  indpendent  given  the  process  parameters.  If  trials  were  not  separated  by 
these  inactive  periods,  we  would  have  to  consider  configurations  spanning  the  entire  exper¬ 
iment  instead  of  individual  trials,  which  would  likely  consist  of  all  possible  combinations 
of  the  trial-specific  configurations  and  be  computationally  intractable.  For  fMRI  analysts, 
this  means  that  HPMs  are  better  suited  to  event-related  designs  (like  the  sentence-picture 
experiment)  than  block  designs  (in  which  many  stimuli  of  the  same  type  are  presented  in 
quick  succession  for  a  period  of  time).  In  other  domains,  HPM  users  are  advised  to  exploit 
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similar  structure  in  the  time  series  of  interest  to  avoid  excess  complexity. 

Finally,  when  using  HPMs  it  is  wise  to  consider  whether  the  processes  of  interest  are 
identifiable.  Processes  are  identifiable  if  there  is  sufficient  variability  in  their  timing  over 
the  course  of  the  experiment  to  permit  a  unique  solution  to  their  deconvolution.  Alterna¬ 
tively,  processes  are  unidentifiable  if  there  are  multiple  settings  of  their  parameters  that 
produce  the  same  log-likelihood  of  the  data.  For  instance,  an  obviously  problematic  case 
is  when  two  processes  of  interest  always  overlap  completely.  Unfortunately,  most  cases 
are  not  as  intuitive.  The  two-process  HPMs  in  the  sentence-picture  experiment  are  iden¬ 
tifiable  because  of  the  stimuli  are  presented  in  both  possible  orders.  However,  when  the 
third  processes  is  added,  the  situation  is  less  clear.  If  the  timing  of  the  third  process  is 
always  the  same,  then  the  end  of  the  ViewPicture  and  ReadSentence  response  signatures 
and  the  beginning  of  the  Decide  response  signature  become  unidentifiable.  If  the  timing 
of  the  third  process  varies  sufficiently  across  trials,  we  may  have  enough  constraints  to 
render  the  processes  identifiable.  In  general,  we  can  improve  identifiability  by  varying  the 
likely  orders  and  timings  of  the  processes  of  interest  in  the  experiment. 

Another  way  to  guarantee  identifiability  is  to  experimentally  isolate  each  process  of 
interest.  For  example,  we  could  augment  the  sentence-picture  experiment  with  several 
preceding  trials  of  single  stimuli.  However,  this  approach  becomes  less  attractive  when 
we  consider  the  linearity  assumption.  While  assuming  that  overlapping  hemodynamic 
responses  sum  linearly  is  common  and  somewhat  reasonable,  deviations  from  linearity  are 
also  well- studied.  The  ViewPicture  process  by  itself  may  be  significantly  different  from 
the  ViewPicture  process  in  the  context  of  the  mental  task  at  hand.  For  this  reason,  we 
recommend  varying  the  context,  order,  and  timing  of  the  processes  instead  of  isolating 
them  to  achieve  identifiability. 


6.3  Contributions 

In  section  1.3,  we  overviewed  research  relevant  to  Hidden  Process  Models.  As  discussed 
there,  most  efforts  to  estimate  hemodynamic  responses  assume  that  the  timings  of  their 
generating  cognitive  processes  are  known.  Classification  techniques  also  make  this  as¬ 
sumption.  HPMs  provide  a  framework  for  fMRI  analysis  that  relaxes  the  assumption  of 
fixed,  known  process  timings,  and  allows  both  classification  and  HRF  estimation. 

One  technique  for  HRF  estimation  that  we  discussed  above  is  the  General  Linear 
Model  (Dale  and  Buckner  [1997],  Dale  [1999]).  HPMs  can  be  viewed  as  a  generalization 
of  the  GLM.  Instead  of  using  a  fixed  design  matrix  containing  the  known  experimental 
design,  HPMs  use  a  design  matrix  containing  several  competing  explanations  for  some 
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time  points.  The  EM  algorithm  discussed  in  Chapter  2  then  iterates  between  determining 
which  explanation  is  correct,  and  estimating  the  parameters  of  the  model. 

Above,  we  also  mentioned  SPM  (Friston  [2003])  as  a  widely  used  fMRI  analysis  tech¬ 
nique,  and  we  return  to  it  here  to  clarify  the  difference  between  that  approach  and  HPMs. 
In  particular,  SPM  generates  maps  that  indicate  the  extent  to  which  particular  voxels  have 
activation  that  is  statistically  significant  for  various  experimental  conditions.  The  HPM 
maps  in  Chapter  5  show  the  learned  parameters  for  the  signatures  of  various  processes. 
They  do  not  make  claims  about  statistical  significance.  We  believe  these  maps  are  still 
helpful,  but  they  should  not  be  interpreted  in  the  same  way  that  SPM  maps  are.  In  gen¬ 
eral,  SPM  is  a  massively  univariate  technique  that  assumes  known  process  timing  to  make 
statements  about  the  statistical  significance  of  voxel  activity  in  various  conditions.  In  con¬ 
trast,  HPMs  are  a  multivariate  technique  that  learns  parameters  of  processes  with  uncertain 
timing. 

Overall,  we  believe  that  the  biggest  impact  of  HPMs  is  to  provide  a  framework  for 
studying  cognitive  processes  with  unknown  onset.  Conveniently,  HPMs  also  allow  clas¬ 
sification,  although  as  we  have  discussed,  they  are  not  optimized  for  that  task.  Our  work 
on  regularizing  HPMs  and  parameterizing  HPMs  with  basis  functions  is  a  starting  point 
for  providing  a  variety  of  modeling  assumptions  about  the  process  response  signatures  of 
interest,  to  accommodate  researchers  who  may  wish  to  model  the  hemodynamic  response 
function  in  different  ways. 


6.4  Future  Directions 


We  see  four  major  avenues  for  future  work  on  Hidden  Process  Models:  work  on  the  repre¬ 
sentation  of  the  hypothesis  space,  work  on  different  parameterizations  of  HPMs,  work  on 
automatically  discovering  the  number  of  processes  in  an  HPM,  and  work  applying  HPMs 
to  more  studies. 

One  current  limitation  of  HPMs  is  the  need  to  enumerate  the  hypothesis  space  of  the 
model  in  the  set  of  configurations.  This  is  inelegant  at  best,  and  intractable  at  worst.  Chap¬ 
ter  4  outlines  a  possible  alternative  to  configurations  using  Dynamic  Bayesian  Networks. 
It  remains  to  be  seen  whether  the  variational  EM  and/or  Gibbs  sampling  algorithms  for 
HPMs  described  in  Chapter  4  can  cope  with  data  as  high-dimensional,  sparse,  and  noisy 
as  fMRI.  Further  work  on  these  approaches  would  be  valuable  since  it  could  save  work  for 
HPM  users  (by  not  needing  to  specify  the  list  of  configurations  for  each  trial)  and  allow 
the  learned  models  to  be  independent  of  the  sets  of  configurations. 
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A  second  direction  for  improving  HPMs  is  to  expand  the  choices  for  parameterizing 
the  model.  For  instance,  it  would  be  interesting  to  introduce  a  parameter  for  the  duration 
of  each  process,  to  be  learned  instead  of  fixed  in  advance.  Experimenting  with  alternative 
noise  models  may  also  be  fruitful,  since  we  have  seen  that  the  Gaussian  Naive  Bayes  noise 
model  outperforms  similar  HPMs  in  classification. 

Thirdly,  it  would  be  very  interesting  to  develop  algorithms  for  automatically  discov¬ 
ering  the  optimal  number  of  processes  in  an  HPM.  The  current  method  of  using  cross- 
validated  data  log-likelihood  to  decide  the  correct  number  of  processes  is  reasonable  for 
choosing  among  a  small  set  of  possible  numbers  of  processes,  but  becomes  impractical 
when  there  is  more  uncertainty  about  how  many  processes  to  model. 

Finally,  there  is  still  work  to  be  done  using  the  HPM  framework  developed  so  far. 
HPMs  could  be  applied  to  more  datasets,  within  the  fMRI  domain  and  outside  of  it.  For 
scenarios  with  appropriate  prior  knowledge,  it  would  be  interesting  to  experiment  with 
more  regularization  penalties,  including  spatial  sparsity  and  spatial  priors.  Since  HPMs 
parameterized  with  basis  functions  have  performed  well  so  far,  it  may  be  worth  investigat¬ 
ing  different  numbers  of  basis  functions,  different  types  of  bases,  and  regularization  for 
spatial  smoothness  on  the  basis  function  weights. 


6.5  Summary 

In  conclusion,  recall  the  thesis  presented  in  Section  1.4:  The  thesis  of  the  research  reported 
here  is  that  we  can  develop  machine  learning  techniques  for  time  series  data  that  can 
simultaneously  estimate  parameters  of  temporal  signatures  and  the  onsets  of  the  latent 
processes  that  generate  them.  These  techniques  can  be  improved  by  incorporating  domain 
knowledge,  and  they  can  be  used  to  express,  learn,  and  compare  competing  models  of  the 
processes  generating  the  data.  This  thesis  is  motivated  by,  and  will  be  explored  within  the 
fMRI  domain. 

In  support  of  this  thesis,  we  introduced  Hidden  Process  Models,  with  the  formalism 
and  algorithms  in  Chapter  2.  We  elaborated  on  this  basic  version  of  HPMs  in  Chapters 
3  and  4.  Then  in  Chapter  5,  we  provided  results  on  synthetic  fMRI  data  (for  which  we 
knew  ground  truth)  demonstrating  that  HPMs  can  recover  the  parameters  of  the  processes 
even  in  the  presence  of  uncertain  timing.  We  also  used  synthetic  data  to  show  that  HPMs 
can  use  held-out  data  log-likelihood  to  select  the  model  with  the  correct  number  of  latent 
processes.  We  showed  that  incorporating  domain  knowledge  about  the  temporal  smooth¬ 
ness  of  the  process  response  signatures  in  fMRI,  via  regularization  and  basis  functions, 
improved  the  performance  of  HPMs  on  real  fMRI  data.  We  also  provided  a  comparison 
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of  several  models  of  the  real  fMRI  study,  and  we  showed  how  the  learned  HPMs  can  be 
visualized  to  aid  interpretation  of  the  processes.  Finally,  the  current  chapter  has  discussed 
the  experimental  results,  advice  for  HPM  users,  contributions  of  HPMs  to  the  field,  and 
future  directions  of  research. 

In  summary,  Hidden  Process  Models  are,  to  the  best  of  our  knowledge,  the  first  at¬ 
tempt  to  simultaneously  recover  the  timings  and  hemodynamic  responses  of  the  cognitive 
processes  underlying  an  fMRI  experiment.  The  evidence  to  date  suggests  that  HPMs  have 
promise  for  fMRI,  but  they  are  still  in  their  infancy.  We  hope  that  progress  on  the  research 
directions  outlined  above  will  make  HPMs  increasingly  useful  to  cognitive  scientists. 
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Appendix  A 
Notation 


This  section  provides  tables  of  notation  to  aid  the  reader  in  understanding  the  formalism 
and  algorithms  for  HPMs  (Sections  2.1  and  2.2,  respectively).  Table  A.l  refers  primarily 
to  the  notation  for  the  formalism,  and  Table  A. 2  is  primarily  for  the  algorithms. 
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Symbol 

Meaning 

Y 

Data  matrix. 

TxV 

t 

Time  point  in  data. 

{1...T} 

V 

Voxel. 

{1...V} 

71 

Process. 

{1.  ■  |n|} 

d{  7r) 

The  duration  of  process  n. 

W(tt) 

The  process  response  signature  matrix  for  process  n. 

d(n)  x  V 

0(vr) 

Vector  of  multinomial  parameters  over  fi(7r). 

Same  length  as  f2(7r) 

n(7r) 

The  vector  of  offsets  for  process  n. 

Integers 

n 

The  set  of  all  processes. 

i 

Process  instance. 

tt(Z) 

The  process  ID  for  i. 

A(<) 

The  timing  landmark  for  i. 

0(*) 

The  offset  for  i. 

c 

Configuration. 

{1...C} 

C 

The  set  of  all  configurations. 

The  variance  of  voxel  v. 

*(■) 

Returns  1  if  and  only  if  its  argument  is  true. 

M  c) 

The  mean  of  the  distribution  over  Y  under  c. 

TxV 

Xc 

The  binary  design  matrix  for  configuration  c. 

TxD 

D 

The  sum  of  the  durations  of  all  processes  d{i r)). 

W 

The  vertical  concatenation  of  W (7r)V7r. 

DxV 

/i(c) 

The  mean  of  the  Gaussian  distribution  over  Y  under  c. 

TV  x  1 

n 

Trial. 

{1...JV} 

Cn 

Configuration  within  trial  n. 

{1  ...cn} 

r 

The  set  of  all  configurations  for  trial  n. 

Table  A.l:  Notation  used  in  the  formalism  for  HPMs,  roughly  in  order  of  introduction. 
The  last  column  indicates  the  size  and/or  domain  of  the  variable,  where  appropriate. 
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Symbol 

Meaning 

T 

The  set  of  HPM  parameters  to  be  learned  from  data. 

xc 

The  vertical  concatenation  of  Xnc  for  all  trials  n. 

VT  x  VD 

Y 

Data  vector. 

TV  x  1 

Yn 

Data  vector  for  a  trial  n. 

Tnv  X  1 

W 

The  vertical  concatenation  of  W (7r)V7r. 

DV  x  1 

a2 

The  vertical  concatenation  of  a2  for  all  trials. 

VT  x  1 

E2 

The  covariance  matrix.  E  =  diag(cr2) 

VT  x  VT 

Z 

Binary  indicator  vector  for  the  correct  configuration. 

Cxi,  £c2fc  =  l 

ZC 

Binary  indicator  variable  for  configuration  c. 

Zc  G  {0, 1} 

M 

Length  of  the  vertical  concatenation  of  the  design  matrices  for 
all  configurations  in  all  trials.  M  —  V  TnCn 

X 

The  vertical  concatenation  of  Xn  for  all  trials  n. 

M  x  VD 

Xn 

The  vertical  concatenation  of  Xc  for  all  configurations  in  trial  n. 

VTnCn  x  VD 

Xnc 

The  block-diagonal  of  Xnc  is  Xnc  repeated  V  times. 

VTn  x  VD 

Y 

Vertical  concatenation  Yn  for  all  n. 

M  x  1 

Yn 

Vertical  concatenation  of  Yn  repeated  for  Cn  times. 

VTnC'n  x  1 

Z 

Expansion  of  Z.  Znc  for  each  element  m  corresponding  to  cn. 

M  x  1 

z 

Z  =  diag(Z) 

s 

Covariance  matrix  for  Y. 

M  x  M 

a2 

Diagonal  of  E.  a2  for  each  element  m  corresponding  to  voxel  v. 

M  x  1 

T 

Weights  for  the  weighted  least  squares  solution.  T  =  'E~1E[Z\ 

7 

^ nc 

Binary  indicator  variable  for  configuration  c  in  trial  n. 

Znc  ^  {0,  1} 

Table  A. 2:  Notation  used  in  descriptions  of  the  HPM  algorithms.  The  last  column  indicates 
the  size  and/or  domain  of  the  variable,  where  appropriate. 
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Appendix  B 

Derivation  of  the  M  step  Update  for  W 


The  M  step  estimate  of  W  under  uncertainty  about  the  true  configuration  of  process  in¬ 
stances  can  be  understood  in  terms  of  an  extension  to  the  General  Linear  Model  (GLM) 
presented  in  2.2.2.  In  that  approach,  we  created  a  design  matrix  Xc  from  the  known  con¬ 
figuration  c.  Here,  we  consider  the  design  matrix  X  reflecting  all  possible  configurations. 
The  construction  of  X  and  the  related  variables  Y,  Z,  and  S  the  are  described  in  Section 
2.2.3. 

We  begin  with  the  expected  log-likelihood  of  the  complete  data,  including  the  observed 
data  Y  and  the  latent  indicator  variable  X : 


Q(T,^old) 


£p(z|y,W1iiP(Y,  ^\^)\ 

Pp(z|Y,*°id)[lnP(Y|Z,  T)  +  In  P(Z|'T)] 


E 


P(Z|Y,^°ld) 


In 


(: 2Y 


M 

2 


-exp 


y  -  xn-’iii-. 


E 


P(Z|Y,^°ld) 


N 


■■dim  p(c\^y 


n  C&Cn 


-ln((2ir)T|E|i)  -  ^E[Z)\\Y  -  XW\\l-,  + 

N 

XX  E[Znc\  ln(P(c|'T)) 

n  cSC„ 


(B.l) 


We  can  maximize  the  expected  log-likelihood  of  the  complete  data  Q  by  solving  for 
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the  W  that  minimizes  the  objective  function: 


fo(W)  =  E[Z]\\Y-±W\\U 

=  {Y-±W)'1h-lE[Z](Y-XW) 


(B.2) 


This  is  a  weighted  least  squares  problem  with  weight  matrix  T 
tion  to  which  is  given  by: 

W  < —  (x'Txj  X'TT 


X  1E[Z],  the  solu- 
(B.3) 


We  arrive  at  this  solution  by  setting  the  derivative  of  fo(W)  equal  to  zero  and  solving 
for  W  as  follows: 


dfo 

dW 

0 

X'YXfB 

W 


-4-(y  -  x.wyr(Y  -  xw) 
dW 

—  ( Y'YY  -  2Y'TXW  +  W'XTXW 
dW  V 

-2X'rf  +  2X'YX1B 
X'YY 

^X'Yxj  X'Tf 


(B.4) 
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Appendix  C 

Tutorial  on  Factorial  Hidden  Markov 
Models 


In  this  section,  we  give  some  background  on  factorial  Hidden  Markov  Models  (fHMMs) 
which  may  be  helpful  in  understanding  the  extension  to  fHMMs  that  corresponds  to  Hid¬ 
den  Process  Models,  described  in  Chapter  4.  We  describe  the  model,  two  exact  inference 
algorithms,  and  two  variational  inference  algorithms  for  fHMMs.  None  of  this  is  original 
work;  instead,  it  is  intented  as  a  tutorial  for  the  unfamiliar  reader.  All  of  the  material  on 
fHMMs  follows  Ghahramani  and  Jordan  [1997]  closely.  The  only  original  work  in  this 
Appendix  is  in  Section  C.1.3,  which  describes  a  small  modification  to  the  second  fHMM 
exact  inference  algorithm  for  HPMs. 

Factorial  Hidden  Markov  Models  are  a  generalization  of  Hidden  Markov  Models  (HMMs) 
(Rabiner  [1989])  in  which  the  hidden  state  is  factored  into  multiple  state  variables  that  are 
conditionally  independent  of  one  another  given  the  observed  data.  A  graphical  depiction 
of  the  model  is  given  in  Figure  C.l.  Let  (S)}  be  a  sequence  of  hidden  states,  using  the 
factored  representation  with  M  chains  so  that  St  =  { S[ 1  ,  ,S'f2) , . . . ,  where  S[m ]  is  a 

K  x  1  binary  vector  containing  a  single  1  in  the  position  of  the  state  value.  In  general  each 
chain  can  take  on  K(rn>  values  but  here  for  simplicity,  we  assume  that  K(m>  =  K,  Vm.  Let 
{Yt}  be  a  sequence  of  observations,  and  let  T  represent  the  parameters  of  the  model.  The 
factorial  HMM  defines  a  probability  distribution  as  follows: 


M  T 

p({s„y,}|#)  =  n  [^(■s!m)|i')]  p(n|s„«)II 

m= 1  t= 2 


M 


-pwisi.*) 

(C.l) 


m=l 
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Figure  C.l:  (a)  Graphical  representation  of  a  Hidden  Markov  Model.  St  is  the  value  of 
the  hidden  state  at  time  t ;  Yt  is  the  observed  data  at  time  t.  (b)  Graphical  representation  of 
a  factorial  Hidden  Markov  Model  with  M  —  3  hidden  Markov  chains.  S\'"!  is  the  value 
of  the  hidden  state  of  the  mth  chain  at  time  t.  This  figure  is  from  Ghahramani  and  Jordan 
[1997], 


We  will  work  with  the  following  distributions  for  the  components  of  P({Si.  V'/}  |  'I-'  ): 

K 

p(s[m)\^)  =  n  (4m)) i,fc  (c.2) 

k=  1 

\ogP(S[m)\^)  =  Sjm)log(vr(m))  (C.3) 

where  is  a  K  x  1  vector  containing  the  initial  probabilites  of  each  state  value  for  chain 
m. 

q(m) 

(C.4) 

\ogP(Slm>\Stl<l)  =  (SlTi)'  Iog(A(m,)5f(m)  (C.5) 

where  A(rn)  is  a  I\  x  I\  matrix  containing  the  state  transition  probabilities  for  chain  m 


P(S,(’")|S1(-1),  *)  = 


K 


K 


n  nn 

k= 1  \j= 1 


iM 


n(m) 

l,j 
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(A 


if  is  P(St(m)  =  j\sjf 


*))• 


iogp(yt|^,^) 


|C|3(2tt)t  x 

M  M 

m=l  m=l 

log(|C|^(27r)?)- 


M 


M 


~{Yt  -  W^S^yC^iYt  -  W(m)S{t 

m= 1  m=l 


»n)x 


(C.7) 


(C.8) 


where  Yt  is  a  D  x  1  vector  of  observations,  C  is  a  D  x  D  covariance  matrix,  and 
is  a  D  x  K  matrix  where  W\ f  is  the  contribution  to  the  %Jh  dimension  of  Yt  from 
chain  m  when  it  takes  on  its  jth  value.  This  equation  says  that  the  observed  data  is 
governed  by  a  multivariate  Gaussian  distribution  whose  mean  is  the  sum  of  a  contri¬ 
bution  from  each  chain  that  depends  on  the  chain’s  value.  Using  these  distributions, 
^  =  {{TT^n\A^n\W^}Vm,C}. 

For  the  inference  problem,  we  are  interested  in  the  probability  distribution  over  the 
hidden  state  sequence  conditioned  on  the  observation  sequence,  which  is: 


p({st,Ytm 

pwm 


=  —P{{St,Yt}\^) 

Zp 


1 

~Z~p 


m=  1 


t= 2 


n  [p(s!m>ii')]  p(uisi.  t)  n  n  «o]  pms,,  «o 

(C. 


m=  1 


ill 


iogp({st}|{yt},*) 


M  T 

=  -  log (Zp)  +  ]T  logP(S[m)\^)  +  ]T  log  P(Yt\St,  tt)  + 

m=  1  t=  1 

M  T 

XE^^VK,*) 

ra=l  £=2 

M  M  T 

-  -  mzp) + x  -si"”  i°g(^(m>) + x  X(s«-.))'  iog(^<m,)s«(m) + 

ra=l  m=  1  £=2 


1  T  T  M 

log(|C'|i(27T)T)--^  y/C'-'F*  +  EE  YjC~1W('m)slm)  - 

t= 1  t=l  m= 1 

T  M  M 

2  E  E  E(5,<”))'(M/(n))W(m)St<’") 

t=l  m=l  n=l 

(C.10) 


C.l  Exact  Inference 

C.1.1  Naive  Inference 

Since  a  factorial  HMM  can  be  expressed  as  an  HMM,  we  can  use  the  forward-backward 
algorithm  (Rabiner  [1989])  to  do  inference  in  fHMMs.  For  a  fHMM  with  M  state  vari¬ 
ables,  each  of  which  takes  on  K  values,  the  corresponding  HMM  has  KM  states.  The 
forward-backward  algorithm  proceeds  by  computing  at(i)  =  P(Y\.,.  St  —  /|T)  for  every 
time  t  and  state  value  i  on  the  forward  pass,  and  (3t{i)  =  P(Yt+i:T\St  =  i,  T)  for  every 
time  t  and  state  value  i  on  the  backward  pass,  where  Ya:b  indicates  the  sequence  from  a 
to  b.  The  forward-backward  algorithm  can  be  implemented  with  dynamic  programming. 
The  forward  recursion  is  given  by: 

e*i(i)  =  P(5,1  =  i)P(yi|S,1  =  i) 

1  <i<KM  (C.ll) 


K" 


P(Yt+l\Sl+1=j)J2nSt+i=j\S,  =  i)a,(i) 


i= 1 


1  <  t  <  T  -  1, 1  <  j  <  K 


M 


(C.12) 
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The  backward  recursion  is  given  by: 


Pr(i)  —  1 

1  <  i  <  Km 


(C.13) 


km 

m  =  J]P«+i|St+i=i)F(S(+1=i|S,  =  !)A+1(i) 

3= 1 

t  =  [T  —  1,  T  —  2, . . . ,  1],  1  <  i  <  Km  (C.14) 

The  complexity  of  the  forward-backward  algorithm  is  0(TK2M).  The  TKM  in  the 
complexity  result  occurs  because  we  do  a  computation  for  every  time  step  and  every  state 
value,  and  the  second  K M  occurs  because  the  sum  over  P(St+ 1  =  j\St  =  i)  also  has  KM 
terms. 

C.1.2  An  Improvement 

We  can  improve  on  the  naive  inference  algorithm  presented  above  by  taking  advantage  of 
the  conditional  independence  assumption  between  Markov  chains  in  fHMMs.  In  this  case, 
the  transition  probability  matrix  can  be  factored  into  M  K  x  K  matrices,  one  for  each 
state  variable  M,  instead  of  the  KM  x  KM  matrix  used  above.  For  the  forward  recursion, 
this  gives: 

km  m  k 

y.  p<s>+i =j\s,=i)= «.(<>  n  e  p(-s‘$ = ^siira) = i<m>)  <c-i5> 

1=1  m=  1  i(m) 

There  are  K  terms  in  each  sum,  and  M  sums  overall,  resulting  in  KM  terms  instead  of 
KM .  This  improves  our  complexity  result  from  0(TK2M)  to  0(TKMKm)  =  0(TMKm+1). 

C.1.3  Exact  Inference  for  HPMs 

Here,  we  attempt  to  modify  the  exact  inference  algorithms  presented  above  for  HPMs. 

Our  hope  is  that  the  additional  structure  provided  by  the  HPM  assumptions  will  improve 
the  complexity  of  the  algorithms. 

The  first  improvement  we  can  make  on  the  fHMM  inference  algorithm  using  the  HPM 
structure  is  to  remove  the  +1  from  the  time  complexity  0(TMKm+1).  We  do  this  by 
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noticing  that  for  any  given  state  St,  the  sum  over  probabilities  of  transitioning  from  any  of 
the  Km  states  to  S',  can  be  simplified.  If  S',  =  0,  then  St~  i  could  only  have  been  0  or  d, 
so  the  sum  is  P(St  =  0 1 S'*— i  =  0)  +  P(St  =  0|Sfe_i  =  d)  =  (1  —  p)  +  1.  If  St  >  0,  the 
state  variable  is  deterministically  counting,  so  the  only  non-zero  term  in  the  sum  over  all 
possible  predecessors  is  P(St  =  x|S't_i  =  x  —  1)  =  1.  So  we  have: 

M  K 

=  (i-P)+i,j(m)=o 

m= 1  i 

=  l,/m)>0  (C.16) 

Decomposing  this  sum  decouples  it  from  the  arity  of  the  state  variable,  K,  which  means  a 
constant  amount  of  computation  for  each  state  variable,  so  we  have  0(TM KM). 

The  second  insight  for  HPMs  as  fHMMs  is  that  if  all  the  state  variables  ]lm>  for  for 
state  value  j  are  non-zero,  then  we  have: 

M  K  M 

n  v  p(.ss  =  i(ro)is,<m) = i) = n 1 = i  (c-i7) 

m=  1  i  m= 1 

For  these  state  values,  the  forward  recursion  simplifies  to: 

M 

<*t+iU)  =  P(Yt+i\St+i  =  J)  II  -  !)  (C-18) 

m= 1 


Finally,  note  that  there  are  2M  —  1  state  values  with  at  least  one  zero  for  one  of  the  state 
variable  values,  regardless  of  the  arity  of  the  state  variables. 

Unfortunately  this  second  insight  does  not  further  improve  our  complexity,  since  even 
though  we  can  remove  the  sum  completely,  we  still  have  a  product  of  M  terms  for  each  of 
Km  state  variables. 


C.2  Variational  Inference 

Since  the  exact  inference  algorithms  presented  so  far  all  have  exponential  time  complex¬ 
ities,  we  now  consider  approximate  inference  algorithms  for  fHMMs.  We  will  begin  by 
explaining  the  theory  behind  variational  inference  in  general,  and  then  give  two  examples 
of  variational  inference  algorithms  for  fHMMs. 
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C.2.1  The  General  Framework  for  Variational  Inference 


Here  we  describe  the  basic  idea  behind  variational  approximations  for  inference  in  graph¬ 
ical  models.  This  section  is  based  on  the  descriptions  given  in  Jordan  et  al.  [1998]  and 
Ghahramani  and  Jordan  [1997]. 

Let  us  call  the  set  of  hidden  variables  in  our  model  H  and  the  set  of  observed  variables, 
or  evidence  E.  The  inference  problem  is  to  compute  the  probability  distribution  over  the 
hidden  variables  given  the  evidence: 


P(H\E) 


P(H,E) 
P(E) 
PjHJS) 
T.H  P(H .  E) 


(C.19) 


When  P (H \ E)  is  intractable  to  compute  directly,  we  can  use  variational  methods  to  ap¬ 
proximate  it  by  a  tractable  distribution  Q(H\E).  Here  we  show  that  any  distribution 
Q(H\E)  provides  a  lower  bound  on  the  log  likelihood  of  P(E ): 


log  P(E)  =  log  E  P(H,E ) 

-  *E«*i*® gj 


H 


J2Q(H\E)log 


H 


Q{H\E ) 
P(H\E)P(E) 
Q(H\E ) 


£  Q(H\E)(\og  +  log  P(E)) 


H 


Q(H\E) 


£  Q(H \E)  log  PPP  +  £  Q(H\E)  log  P(E) 


H 


£  Q(H\E)  log 


H 


Q{H\E) 

P{H\E ) 

Q(H\E) 


H 


+  log  P(E) 


(C.20) 


(C.21) 

(C.22) 


where  in  Equation  C.20  we  have  used  Jensen’s  inequality  (Wasserman  [2004]).  Subtract¬ 
ing  log  P(E)  from  both  sides,  we  see  that  the  difference  between  the  left  and  right  sides 


115 


is: 


0  > 

o  : 


Y,QiH\E)  log 

H 

Y,Q(H\E)\og 

H 


P(H\E) 

Q{H\E) 

Q(H\E ) 
P{H\E) 


(C.23) 


This  last  line  is  the  Kullback-Leibler  (KL)  divergence  (Wasserman  [2004])  between  Q 
and  P,  KL(Q\\P).  This  means  that  the  KL  divergence  between  an  arbitrary  distribution 
Q(H\E)  and  the  true  distribution  P(H\E)  is  a  lower  bound  on  the  log  likelihood  of  the 
observed  evidence. 

Variational  inference  methods  proceed  from  this  result  by  choosing  a  family  of  dis¬ 
tributions  Q(H\E)  defined  by  a  set  of  conditional  independence  relationships  among  the 
variables  in  H,  and  then  choosing  a  member  of  that  family,  which  corresponds  to  choos¬ 
ing  particular  settings  for  the  parameters  of  the  family  Q.  These  parameters  are  called 
variational  parameters.  To  make  the  bound  on  the  log  likelihood  as  tight  as  possible,  the 
variational  parameters  are  chosen  to  minimize  the  KL  divergence  between  Q  and  P  (e.g. 
by  setting  the  derivative  of  Equation  C.23  to  zero  and  solving  for  the  parameters  of  Q). 

Having  chosen  and  parameterized  Q,  we  can  approximate  P(H\E)  with  0(11  \E) 
and/or  compute  expected  sufficient  statistics  under  Q(H \E)  to  be  used  in  learning  model 
parameters  in  an  EM  algorithm. 


C.2.2  Mean  Field  Approximation  for  Factorial  HMMs 

In  this  section,  we  describe  the  mean  field  approximation  for  factorial  HMMs.  Both  the 
method  and  the  notation  are  based  on  Ghahramani  and  Jordan  [1997]. 

The  first  step  in  developing  a  variational  method  is  to  choose  the  family  of  distributions 
to  be  used  for  Q;  that  is,  to  choose  the  set  of  conditional  independence  relationships  that 
hold  for  Q.  The  simplest  choice  of  Q  for  fHMMs  is  to  decouple  all  of  the  state  variables, 
breaking  the  links  of  each  chain  going  forward  in  time  and  the  links  from  each  chain  to  the 
observed  variables  at  every  time  point.  This  completely  factorized  approximation  is  called 
the  mean  field  approximation,  and  is  depicted  in  Figure  C.2. 

The  family  of  distributions  represented  by  this  structure  is  given  by: 

T  M 

o({s,}i») = n  n  o(4’")  i»!m>)  (c.24) 

f=l  m= 1 
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Figure  C.2:  Graphical  representation  of  the  approximating  family  of  distributions  for  the 
mean  field  variational  approximation  for  factorial  HMMs.  S\rn'>  represents  the  state  of  the 
mth  chain  at  time  t.  In  the  mean  field  approximation,  all  state  variables  are  decoupled. 
This  figure  is  from  Ghahramani  and  Jordan  [1997]. 


where  6  =  }  are  the  variational  parameters,  also  represented  as  K  x  1  vectors  defining 

a  multinomial  distribution  on  s[m\  in  which  the  kth  element  contains  the  probability  that 
Sim)  is  in  state  k : 


I\ 


Q(s!m> \e{r])  =  m 


(C.25) 


k= 1 


Adding  a  normalization  term  Zq  to  ensure  that  9^  =  1,  we  have: 

T  M  K 


o({s,}i»)  =  ^nnn<« 

t=  1  m=  1  k= 1 
T 

log<3({S,}|9)  =  -  log(ZQ)  +  log 0 


iWisS’ 

't.k  ) 


T  M 


t=  1  m= 1 


(C.26) 


Given  this  family  of  distributions,  we  wish  to  choose  the  variational  parameters  6  that 
minimize  the  KL-divergence  between  Q{{Si}\0)  and  the  true  /;;({-5'/}| {F/}.  T),  given  by: 


^qiW  =  Eq(WI»)]og  ggg; 

mtf 


(C.27) 
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KL(Q\\P) 

KL(Q\\P )  = 


KL(Q\\P) 


E iogQ({S't}|0)  -  E  «({$}!*) iogP({5t}|{yt}, *) 

{St}  {St} 

(C.28) 


(t  m  \ 

-  log(Zg)  +  £  £(S,('”))' log  +  MZp )  - 

t=l  m=l  / 

/  M  M  T  \ 

£  s!’”>  log(ir(m))  +  £  £(SS»)'  log(^(”‘»)S,l’")  - 


£0({si}|«) 

{St} 


\m=  1 


77i=l  £=2 

T 


£q({S,}|»)  log(|C[»(2ir)T)  -  -£KC-1K« 


{St} 


V 


t= 1 


T  M 


£«(«}  w££ 


{St} 


t=l  m=l 

T  M  M 


£«({«<}  i«)5£££(y> 

{St}  ^ 


m) 


t=l  m=l  n=l 


(C.29) 


T  M 

-  log (ZQ)  +  £  £(Bg[5,<m)])'ioge,(m)  +  log(Zp)  - 

t=l  77i=l 

M  M  T 

£  B«[s!m)]  logl^r*”1)  -  £  £tr{B0[S<"*)(S<l"l)1  log(^(m))}  - 

771=1  771=1  t=2 

log(|C|!(2ir)T)  +  j£l','C_1F,  - 
^  t=l 

T  M 

EE 

i=l  m=l 
1  M  T 

^EE  lr{dia^{JBQ[5t(m)]}(iy{m))/C'-1iy{m)}  + 

771=1  £=1 

M  T 

^EEE  tr{EQ[s\m\s[n))\W^))'C-1W^m)} 

m= 1  n^in  t= 1 

(C.30) 
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where  tr{}  is  takes  the  trace  of  a  matrix  and  diag{}  puts  its  argument  on  the  diagonal  of 
an  otherwise-zero  matrix. 


In  this  case,  the  expected  values  of  the  state  variables  are  simple  because  Q  is  fully 
factorized: 


B0[S<m)]  =  9,(m) 

(C.31) 

SalsJ-’tsS)']  = 

(C.32) 

EQis(rHsln)y\  =  <tW)' 

(C.33) 

We  now  plug  these  expected  values  back  into  the  KL  divergence  and  take  the  derivative 
with  respect  to  the  variational  parameters: 


dKL(Q\\P) 

dO? 


— iog(zQ)  + 1  +  log^  -  {w^yc^Yr  + 1  a  {(w^yc-'w®}  + 
dOr  ^ 

M 

J(lf('))'C-1f(n)^)  -  logA^+i  -  (^Ij'log^® 

n^l 

(C.34) 


where  A{}  returns  a  vector  containing  the  diagonal  of  its  argument. 

Setting  this  derivative  to  zero  and  solving  for  9?  gives: 

0  = - \og(ZQ)  +  1 +  \og 9^ -(W^),C~1YT  + l A{(W^)'C-1W^}  + 

d9r  ^ 

M 

^(lk(,))'C-1lf(n)^)  -  logA^^+i  -  (flflj'log  Aw 

n^l 

(C.35) 


log 


~4y  log (ZQ)  - 1  +  (w^yc-x  -  l-A{{w®)'c-lwV} 
dO  t 


M 

+  logA(m)4+i  +  (^Ij'log  A(l) 

n^l 


(C.36) 
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0®  =  exp (c  +  (wWyC-'Yr  -  ^A{( W^)' C~lW®}  - 

M 

^(W^'C-'W^  +  log  A<-m)6%  +  (4-i)'log^(0) 

n^l 

M 

=  exp(c  +  (W^yC-^Yr  -  ^(iyw)'f)  - 

n-=/=l 

{(w^yc-'w^)  +  iogA(m^?|i  +  (e^yiogA^) 

=  exp  (^W^yc-1?^  -  ^A {{W^yc-'W^}  +  log,4(m^{i  +  (O^y  log 

(C.37) 

where  Yr(l)  =  YT  -  Y^i  (  W(n>yO,ln>.  the  residual  error  in  Yr  given  the  predictions  from 
all  the  state  variables  except  for  l.  The  constant  c  has  been  removed,  and  in  its  place  we 
renormalize  the  vector  9?  after  every  computation. 

Equation  C.37  defines  a  set  of  fixed  point  equations  for  the  variational  parameters. 
We  can  iterate  on  these  equations  until  the  KL  divergence  converges.  Notice  that  the 
terms  in  the  fixed  point  equation  for  a  particular  9(tm'1  involve  the  other  parameters  in  the 
Markov  blanket  of  0fin>  (the  variational  parameters  for  chains  n  y  m  at  time  t,  and  the 
variational  parameters  for  chain  m  at  times  t  —  1  and  t  +  1).  Therefore,  even  though  the 
approximating  distribution  is  completely  factorized,  its  parameters  are  coupled  according 
to  the  dependencies  in  the  true  distribution. 

Having  solved  for  the  variational  parameters,  we  can  use  them  to  do  inference  about 
the  hidden  state  sequence  using  Q  or  we  can  take  the  expectations  under  Q  of  the  sufficient 
statistics  of  P  to  be  used  in  an  EM  algorithm  for  learning  the  parameters  T  of  P. 


C.2.3  Structured  Approximation  for  Factorial  HMMs 

In  this  section,  we  describe  a  structured  approximation  for  factorial  HMMs.  Again,  both 
the  method  and  the  notation  are  based  on  Ghahramani  and  Jordan  [1997]. 

As  before,  we  start  by  choosing  a  family  of  distributions  with  which  to  approximated 
P ( { Si }  |  { Y] } ,  T  j .  This  family  is  depicted  in  Figure  C.3.  Unlike  the  mean  field  approxima¬ 
tion,  this  family  of  distributions  retains  the  temporal  structure  of  the  hidden  state  variables. 
It  essentially  treats  each  chain  as  an  independent  HMM,  decoupled  from  the  observed  data. 
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Figure  C.3:  Graphical  representation  of  the  approximating  family  of  distributions  for  the 
structured  variational  approximation  for  factorial  HMMs.  represents  the  state  of  the 
mth  chain  at  time  t.  In  this  approximation,  the  temporal  structure  of  each  chain  is  retained, 
but  the  chains  are  treated  independently  (the  arrows  from  the  state  variables  to  the  observed 
variables  are  broken).  This  figure  is  from  Ghahramani  and  Jordan  [1997]. 


Let  us  write  this  family  Q  as: 


M 


z0 

-  m=  1  t= 2 

We  choose  the  following  parameterization  for  the  components  of  Q: 

K 


Q(S'[m,\6) 

iogQ(S<mV) 


logQfsf'lsS.O) 


?(m) 

,k 


k=  1 

E  s!?  iog(fc!?4m)) 

k= 1 

((Sjm)),logMm)  +  (^m))/log7r^ 


j(m) 


n(fti?n<4 

k= 1  V  j= 1 

(s™)' log  t£m)  +  (s£}y  log 


(C.38) 


(C.39) 


(C.40) 
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The  variational  parameters  are  6  =  A^m\  }itm}  }Vm,  t,  where  Iifm>  is  a  K  x  1  vector 

that  intuitively  plays  the  role  of  the  probability  of  a  fictitious  observation  for  each  of  the 
K  settings  of  S[m\  So  we  have: 


Q({$}|0) 

logQ({Pt}|0) 


K 


M  K  T  K 

j-  niKCvrvs’nnidi^: 

®  m= 1  fc=l 


q(m) 

Jt,k 


t= 2  fe=l 


M 


3= 1 
M  T 


m= 1  i=l 


-  log(Zg)  +  X  (Si”’)'  log  ’T(m)  +  £  £(S,‘m|)'  log  A!m,  + 

ra=l 

M  T 

££(«  )'log^M^m) 


m=  1  £=2 


(C.41) 


We  follow  the  same  reasoning  as  above  in  determining  how  to  set  the  parameters  9  of 
Q\  that  is,  we  again  seek  to  minimize  the  KL  divergence  between  0  and  P : 


KL(Q\\P)  —  £  Q({St}l$)  log 
(M 


g(W|g) 


(C.42) 


KL(Q\\P)  =  Q{{St}\8)  logQ({$}|0)  -  5]  g({5t}|0)  logP({St}|{£t},  4/) 

{-S'*}  {St} 

(C.43) 
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KL(Q\\P) 
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(C.44) 
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KL(Q\\P) 
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771=1  t= 2 

T  T  M 

-Ekc-7-EE 

£=1  t=l  771=1 

1  M  T 

^EE  tr{diag{EQ[S^m)]}(W{m)yC-1W{m)}  + 

771=1  t=l 

M  T 

^EEE  tr  {Eq  [St(m)  (St(n) )']  ( w(n)  yc-1  w(m> } 

m= 1  n^m  t= 1 

(C.45) 


M  T 

KL(Q\\P)  =  -  log(Zg)  +  E  E  fiQ[S«(’”l]'log/4’")  +  'og(Zp)  + 

m=  l  t=l 

1  T  T  M 

-E^/c-'y.-EE 

t=l  £=1  771=1 

1  M  T 

^EE  tr  {diag{EQ  [St(m)] }  ( W(m)  )/C'“1  }  + 

771=1  t=  1 

1  M  T 

^EEE  tr{EQ[sim\sin)y](w{n)yc-1w{m)} 

m= 1  n^m  t= 1 

(C.46) 
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Note  that  we  can  write  this  as: 


KL(Q\\P) 


M  T 

-  log(Zo)  +  log(Zp)  +  EQ[Slm  )]'log  + 

m=  1  t= 1 


M  T 

EE  f(EQ{s‘m']) 


m= 1  t= 1 


(C.47) 


where 


/(E0[s,(m)])  =  ^iY;c-'Yt-Y;c-1w^EQ[s{r>}  + 

^tr{diag{EQlSim)}}(Wim>yC-1lVi’n)}  + 
i  V  «r{B<j[S,(’")(5t(n))'](iy(”>)'C-1iy("*)} 

n^m 

(C.48) 

which  makes  the  derivative: 
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where 
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Substituting  these  back  in  gives: 
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d  log  h 


(C.53) 


We  will  solve  this  equation  by  setting  the  terms  inside  the  first  term  of  the  product  (with 

dEn  rS(m)l 

— t(l)  being  the  second  term)  equal  to  zero.  This  will  clearly  find  a  solution.  Since  the 

d  log  hr  '  '■ 

KL  divergence  is  convex  (though  we  will  not  prove  this),  any  solution  to  this  equation  is  a 
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global  optimum.  This  lets  us  avoid  computing 
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d  log 
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log  /4m) 
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M 

-  ^{w^yc^w^EQis^} 
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-  j2(w{m)yc~lw{n)EQ[stn) ]] 

n^m 

exp[(W"(m))/C'-1y/m)  -  ^A{(W"(m))/C'-1W"(m)} 

(C.54) 


where  again  >y  '"  =Yt  —  the  residual  error  in  yt  given  the  predictions 

from  all  the  state  variables  except  for  m. 

At  the  end  of  this  derivation,  we  end  up  with  a  set  of  fixed  point  equations  for  ft}'”"1 
that  depend  on  EqIsI™1}.  The  estimation  of  the  variational  parameters  therefore  proceeds 
iteratively,  using  the  forward-backward  algorithm  for  HMMs  to  calculate  the  expected 
values  for  each  m  using  the  current  h  values,  and  then  updating  the  h  values,  iterating 
until  convergence. 
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Appendix  D 


Derivation  of  Sampling  Distributions  for 
HPM-DBNs 


In  Section  4.4,  we  presented  a  Gibbs  sampling  algorithm  for  doing  inference  and  learning 
in  HPM-DBNs.  On  each  iteration  of  the  Gibbs  algorithm,  we  sample  a  new  value  for  each 
parameter  in  the  set  4/  from  its  full  conditional  distribution.  In  this  appendix,  we  provide 
the  derivations  of  the  full  conditional  distributions  each  components  of  4'. 

Recall  that  4/  and  each  of  its  components  are  proportional  to  the  likelihood  times  the 
prior: 


p^\{i,s,y})  oc  p({/,s,y}|4>)p(4>) 


(D.l) 


where  the  likelihood  is: 


t  n 


p({/,s,r}it)  =  nn^^yi/y.^)]  x 

t=  1  7T=1 

n  niPisi'v;:’,.  {si-p  h  >< 

t= 2  7T— 1 

T 

np(vi{s,},n',s) 


t= i 


(D.2) 
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the  log-likelihood  is: 


log  P({I,S,Y}\V) 


t  n 
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(D.3) 


and  the  prior  over  the  set  'I-'  is: 
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oEQ(7r) 

(D.4) 
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So  for  some  normalizing  constant  c  we  have: 


log  P(*\{I,S,Y}) 


t  n 


t=  1  7T— 1 

t  =  2  7T  — 1 
T 

J]iogP(y,|{s,},w,E)  + 

t=  1 

n 

logP(S)  +  5>gP(^})  +  log  P(6^7r))  +  log  P(W^)  + 


7T—  1 


logP(4'>)  +  Y, 

06^(77) 


(D.5) 


Below,  we  choose  priors  for  each  component  of  T  and  show  how  each  conditional 
distribution  is  derived  by  grouping  terms  not  involving  the  parameter  of  interest  into  the 
normalizing  constant  c. 


D.l  Sampling  b 


Let  us  focus  first  on  the  parameters  bU)  =  {b^\  //f  1 }  Vtt.  Since  b{^  and  b(f'1  are  the  param¬ 
eters  of  multinomial  distributions,  we  will  use  the  conjugate  priors,  Dirichlet  distributions 
with  parameters  and  a\v> .  For  b\j\  we  have: 


°o 

~  Dir(a^) 

D(n) 

p(bi'>) 

1  TT  (b{n))a°r 

R/„W\ 

£>\ao  )  j= 1 

log  P(C) 

=  -log(S(af)))  +  (aJ 

lyiog&r 


(D.6) 


where  ag7r'>  is  a  x  1  vector  and  P(-)  is  the  Beta  function.  The  prior  is  analogous  for 
ot^\  Letting  { T  \  b\^) }  denote  all  parameters  in  T  except  b\^)  and  grouping  all  terms  not 
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involving  into  the  normalizing  constant  c,  we  have: 

logP^K*  \  #>},  {/,  Y})  =  c  +  log  P(S[n)\l[n),  b(;\  bP)  +  log P(b^) 

=  c+(i-ii*\s[n)yiogbi?)+ii”\s^y\ogb(?)  + 
-log +  (a(0w)  -  lyiogb^ 

=  c+(l-/1W)(^)yiogr  +  (a^}  -  l)\ogb{Q] 

=  c+((l~I^)S^  +  a^-iy\ogb^ 

(D.7) 


A  similar  derivation  for  6,7r'i  yields: 


b™\{*  \  by>},  {/,  5,  y}  ~  Pir(/r^r + «n 


'  M 


O)  o(7r) 


(D.8) 


D.2  Sampling  A 


Next  consider  the  transition  matrix  parameters.  For  each  n  we  have  a  known  matrix 
and  unknown  matrices  A71  Vo  G  0(7r).  Each  of  these  matrices  contains  only  one  free 
parameter,  which  is  element  (0, 1),  the  probability  of  chain  n  transitioning  from  0  to  1 
under  the  conditions  represented  by  the  particular  A  V .  We  will  call  these  free  parameters 
pV-  The  probability  of  transitioning  from  0  to  0  in  these  matrices  is  then  1  —  pV,  since 
no  other  transitions  are  legal.  Then  the  values  of  p0  are  the  parameters  of  Binomial 
distributions  over  W  i  and  their  conjugate  priors  are  Beta  distributions.  For  pV  we 
have: 


p{: > 


Bet 


P(pP)  = 


(vr) 


B($\  7^5 

logP(pW)  =  -logB(^\^)  +  (f3^-l)logp^  +  (^ 


l)log(l-pW) 

(D.9) 


Here,  VV  and  TcV  are  the  parameters  of  the  distribution,  and  B ( • )  is  the  Beta  function 

(A) 

again,  not  to  be  confused  with  the  Beta  distribution.  For  a  particular  pu  ,  we  collect  terms 
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of  the  full  conditional  to  get: 


log /Vi*1 1 {»  \  £4,  {I,  s,  Y})  =  c  +  j>gP(S<'>|S<:>,  {/£,(„)}.  A<’>)]  + 

t.= 2 

logi’tp'r1) 


<=+£[(  i-<(s£3>o)E^I  (sS)'iog4')s,('>+ 


(tt) 


y  ■ 


I  O)  oO) 


t=  2 


i(sf>  >  0)  £  (4:i(s£).)'iog4'W’))]  - 

O 

log  b(/3<t\  7<r))  +  (/4'>  - 1)  log  pw  + 
b{:}  - 1)  log(l  -  £4 

=  o  +  >  0)  (/^S^Og^'W'1)]  + 

t=2 

(t£}  - 1)  logpw  +  b{;}  - 1)  iog(i  -  p?>) 

=  c+f>(s«3  >  o)4:i  (s!%  logos'’1)  + 


t=  2 


Hsiy  >  0)4:1  (st(!>li0  iog(i  -  pi'1)^?)]  + 
(P{f  - 1)  logpi’1  +  (o',’1  - 1)  mi  -  p{:’) 


=  c  + 


I>(s£2  >  o)4:Lst(:l„sS 


M  qW  qO 


0  -  1 


X 


,  t=2 

logpL'1  + 


pL')|{S'\pL')},U,5,n 


>  o)4-isi:\fis^] +7w  -ix 

gt=2  / 

log(l  -  p<f>) 

>  0)4:14-1, osui  +  p{j\ 


t= 2 


2>(s£3  >  o)4isj:i0st0 


(D.10) 


t=2 
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D.3  Sampling  W 


We  will  treat  the  W  matrices  one  process  and  one  time  point  at  a  time.  That  is,  we  will 
work  with  WQ  ,  the  length  V  vector  that  is  contributed  to  the  output  variable  by  process 
7 r  at  the  dth  time  point  after  it  begins.  (Note:  technically  is  1  x  V.  To  avoid  tedious 
notation  in  the  derivation  below,  when  we  write  W ^  we  will  assume  that  it  has  already 
been  transposed  into  aFxl  column  vector.)  We  define  a  multivariate  Gaussian  prior  over 
W^]  with  parameters  (aFxl  vector)  and  (aFxF  matrix): 


P(W{/]) 

logPlVj'1) 


,(7r)wv>(,rh-i/ 


rd 0 


,  (7r)> 


=  (2t)172|s..»|,/2  w •)) 

=  -  log((2Ir),'/2|S‘'>|1/2)  -  \(W?>  -  Pr'iwP  - 

=  c  -  l(w2i'))'(E?,)-1w2F  +  (Wfl)'(£<'l)-vi')  - 

(Dll) 


The  full  conditional  distribution  is: 


iogP(iyj')|{<P\iyj’)},{/,5,y}) 


c + E>g  wks<}. w-  s)i  +  '°g 

(=i 

c+E[«)'e-1E(w,(’,)'s.w- 


t=  1 


J2(w^ys^}  - 

7T  7 T 

l(w^)'(T,f)-'w^  +  (wT’ypf'rv?’ 

(D.12) 
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iogF(iy«|{vi>\»'j')},{/,s,y}) 


c+^Kiyr^ys-'yj 


t=  1 


£[(wr^)'£-,wf,s<j]  - 


t=l 

T 


EkE^^'E^'W'1 


t= 1 


§(»fl)'(E'lr1»fl  +  (M-r')'(£irrvr 


^(7r)\//-<r^(7r)\  — 1  ..C71-) 


c-  j W7  (  +  (Si  )_1  )  + 


.  i=l 


(»f r  x 


ZlstJ s-‘«  -  i  ^(W'<',,)'S'r))]  +  (E’EV?1 


v£=l 


c  -  \<yvP)'  D^J)2^"1  +  (^i70)-1 )  ^  + 

\t=l 

(»f 1 7  (ek sff)2]^-1  +  (EE1)  x 

\t=i 

EKs'J)^-1  +  (EE 


,  t=l 


s-1  EE'M  -  s  E  (^<’''))'s't(’r'>)]  +  tEEv?’ 


t=i 

T 


7r'E7r 


ElEK'iE'EEE1 


x 


.  t=i 
T 


s-1  e$?  «  -  o  E  (»r,’'))'sr))] + (EE;e 


t=i 


7T,E7r 


-1 


EpS’e1  + 12?1'-1 


(D.13) 


x=l 
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D.4  Sampling  E 


Recall  that  E  is  a  diagonal  covariance  matrix;  that  is,  it  has  the  values  of  a2  on  its  diagonal 
and  it  is  zero  off-diagonal.  This  means  we  can  write  P(Yt\{St},  'T)  as: 


v 


mi{s,}.>p)  =  np«.»i{s,},t) 


(D.14) 


V=1 


where 

and 


vt,v  =  E(wT))'s(? 


7T 


(D.15) 


Below,  we  derive  the  full  conditional  distribution  for  each  a2,  independently.  We  define 
the  prior  over  each  a2  to  be  an  inverse-gamma  with  parameters  rjv  and  vv: 

a2  ~  InvGamma{rjvl  vv) 


n°D  = 


K)7'*  ( i 


r(^) 


Vv  +  l 


—  exp(— — ) 


at 


log  P{a2v)  =  Vv  log  Vv  -  log  T(r/,,)  -  (r)v  +  1)  log  a2v - | 


at 


(D.16) 


The  full  conditional  distribution  is  then: 


logP(<T„2|{vl>  \  a.2},  {I,S,Y})  =  c  +  j>gP(y,,„|{S,},  W, £)]  +  logP(<72) 


t=  1 


=  c  +  Tlog(av)  - —^2[(Yt. 


v  H't.v 


t= 1 


(Vv  +  1)  log  av - 2 


at 


c  ~  (^T  +  Vv  +  1)  log(cr^)  —  {vv  +  ~  —  V>t,v)2})  — 


t= l 


1  1 

a2 |{'T  \  a2},  {/,  S ,  Y}  ~  InvGamma(-T  +  Vv,  uv  + 


t= 1 


(D.17) 
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e  to 


Appendix  E 

Variational  Inference  for  HPM-DBNs 


Here  we  adapt  the  structured  variational  approximation  for  fHMMs  (see  the  tutorial  in 
Section  C.2.3  for  more  details)  to  the  model  described  in  Chapter  4  (with  process  ordering 
constraints).  Since  the  exact  structure  of  the  model  varies  with  different  modeling  choices 
(e.g.  the  length  of  influence  of  an  input  variable  on  a  process  chain)  we  will  work  from  an 
example,  shown  in  Figure  E.l,  which  we  will  approximate  as  shown  in  Figure  E.2. 
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1(1) 


Figure  E.l:  Example  HPM-DBN  with  process  ordering  constraints  for  which  the  varia¬ 
tional  inference  algorithm  is  developed.  This  is  the  graphical  structure  of  the  probability 
distribution  P  in  the  text. 
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1(1) 


s(1) 


I  (2) 


S(2) 


Figure  E.2:  The  structured  approximation  Q  to  the  true  HPM-DBN  (P)  shown  in  Figure 
E.l. 
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We  begin  by  writing  down  the  probability  distribution  for  the  true  model  (Figure  E.l): 


T  M 

P({I,S,Y})  =  fllW)!* 

t= 1  m=  1 

M  T 

n  p(sim)\iim))  n  Pistwi  {s,(”f")}) 

m=l  t= 2 

OpWIM1)] 

t= 1 

(E.l) 


The  components  of  this  distribution  are  as  follows  (they  are  explained  in  more  detail 
in  Chapter  4): 


log  Pdf’) 


/M  logpM  +  (1  -  4m))  log(l  - 


(E.2) 


iogp(^m)|/jm)) 


n 


7T, 


W\S|: 

0,fc  / 


(m) 


r(m) 


n 


7T 


(m) 


1  ,fc 


(1  -  /M)((S-M)/iog7rM  +  /M(^)'log7rSm) 


(E.3) 
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P(stwiui-l{rn)h{s^m)})  =  [U(<l)snAf 


1-«(S&>0)  £„/£*> 


X 


iogP(s<’">|s<!"1),{/a(m)},{sS'")}) 


h3 


M  )S^lAy 

o,i,jJ 


n  fnw 

O  \\  l,j 

(sS)'  log  A^sim>  + 


s(s£\>o)i^0' 


X 


5(si:\  >  0)5:  [lastly  logA^s^ 

o 

(E.4) 


The  form  of  P(S(fm)\Sl^l ,  { I(t'%[rn) } •  { Sr"^'"} } )  varies  depending  on  specific  model¬ 
ing  choices.  Generally,  S[m>  can  depend  on  some  number  of  input  values  where 

Q(m)  is  a  vector  of  the  legal  offset  values  for  process  m.  The  length  and  values  of  these 
vectors  vary  from  one  process  to  another,  and  therefore  from  one  chain  to  another.  S(tm'1 
can  also  depend  on  zero  or  more  other  processes  (n  ^  m).  These  dependencies  encode 
process  ordering  constraints.  In  this  example,  chain  2  at  time  t  depends  on  chain  1  at 
time  t  —  1  but  in  general  the  dependence  could  be  on  any  time  point  of  any  other  chain. 
{g(n^m)}  can  a]SQ  ke  l|lc  empty  set,  as  js  the  case  for  chain  1,  which  does  not  depend 
on  any  other  chains.  Here,  A ^  contains  the  transition  dynamics  for  the  case  where  the 
relevant  input  values  are  0  and/or  a  chain  on  which  m  depends  has  not  yet  begun.  /l[m  en¬ 
codes  the  dynamics  for  the  case  where  the  input  value  at  t  —  o  was  1  and  any  dependencies 
on  other  chains  are  satisfied. 
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P(Yt\St) 


\ogP(Yt\St) 


\C\H2n)  t  x 


M 


M 


exp  |  W{m)Sim))'C-\Yt  -  W^S(t 


m)\ 


m=  1 


m=  1 


1  „  .  D  , 


log(|C'|  2  (2tt)  2 )  - 

„  M 


M 


-{Yt-J2  W^S^yC-^Yt  - 


m)x 


777=1 


771=1 


(E.5) 


Putting  all  this  together  gives: 


log  P({I,S,Y}) 


M  T 


M 


E2>prt->)  +  E  logP(S[’”l|/1<m|)  + 

m=  1  t=  1  m=  1 

M  T  T 

E  E  los  m<m)|s,<-i),  {S(n)})  +  E  lo«  -P(Vf|Ss(1),  s«<2>) 

771=1  t= 2  t=  1 

M  T 

EE  4”  logp(m)  +  (1  -  It(m))  log(l  -  p(m) )  + 

771=1  t=l 

M 

^(1  -  /M)((SM)/iog7rM  +  jM(s-("0)'log7rM  + 


m=  1 
M  T 


EE[ifR-l>  o)E42 

777=1  £=2  \  O  , 

(s<!”1))'iog4"')s;ro)  + 

i(sS  >  o)  ■£  (ii-listi)’ + 


EMICI’P 


D  1 

7T)  2  )  —  — 


M 


M 


t=l 


(Yt  -  J2  W^S^yC-^Yt  -  W^m)sim)) 

(E.6) 


777=1 


777=1 
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The  approximating  distribution  (Figure  E.2)  is: 


Q({i,s}) 


T  M 


X 


o  n 

t=  1  m=l 

M  T 

;W|  rM'i 


II  Q(s!m)l4m>)  n  0('S.<’”l|si("\),  {/,<m) 


})] 


m= 1 


t=2 


(E.7) 


The  components  of  Q  are  similar  to  those  of  P  except  for  the  addition  of  variational 
parameters  h  to  replace  the  dependence  on  outputs  Y  and  the  process  ordering  constraints: 


Q(sim)|/im)) 

logQ(S<”‘)|/.<”)) 


n<" 


(m)  {m)\S\ 
1  ,k  H0,k 


(m) 


(1 


Am) 


rh 


(m)  hn)sS{ 
1  ,k  nl,k  ) 


(m) 


(Sim)yiogh[m)  +  (1  -  /{m))(S'im))/ log4m)  +  l|m)(5ira))'log^m) 

(E.8) 


Q(sr\sn,m(m)}) 


log 


n  1 


<•»)  \sSMf 


m  (iKcnw 


0,1, J  > 


(sSr 


g(™) 


+ 


o 

(E.9) 


Note  that  in  the  structured  approximation,  there  is  no  dependence  on  other  chains  in 
Q(,5fm)|,S')'"l),  { } )  •  The  full  distribution  is  then: 
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log  Q({I,S}) 


T  M 

££u*P(J<">)  + 

t= 1  m= 1 

M  M  T 

X>gQ(s<",>i4",))  +  E  El°s<?(s*<‘)is-<->)-{/<’”)) 

m=  1  m=  1  £=2 


T  M 


EE  /«”')logp<’">  +  (l-/«”l)log(l-pC"')  + 

i=l  m=  1 
M 

£(S<m))'log/i<m)  +  (1  -  /M)((SM),iog7rM  +  /M^My  log7r( 

ra=l 

M  T  /  \ 

E  E<s<(m|)'  log  4m)  +  i  -  E  Ei  (s,1-.’)'  log  4m,'S.(m)  + 


ra=l  t=2 


E  it-o(s!m!y  log  4m)s,<m) 


(E.10) 


As  discussed  in  the  overview  of  variational  inference  (Section  C.2.1),  we  wish  to 
choose  paramters  of  Q  to  minimize  the  KL  divergence  between  P  and  ().  The  KL  di¬ 
vergence  is: 
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KL(Q\\P) 


-  log  ZQ  +  log  Zp  +  £  WL  5})[logQ({/,  S})  -  log  P({I,  S,  y})] 

{5} 
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T  M 
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M  T 
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M 
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(5t<m1))'log4m)S<",>  + 
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(E.ll) 
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After  simplifying,  this  is: 


KL(Q\\P) 


-  log  Zq  +  log  Zp  +  J2  Q{{I,  -S'})  [ 
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m= 1  t= 1 
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m=  1 


m=  1 


(E.12) 


We  wish  to  set  the  derivative  of  the  KL  divergence  with  respect  to  the  variational 
parameters  equal  to  zero  and  solve  for  their  updates. 
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dKL(Q\\P) 
d  log  h? 
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+ 
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EErt!  =  - 
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<E  ^ixKiog  4"%  is.'”1  (sS )']))] 


(E.13) 


We  can  solve  this  equation  for  the  parameters  h  using  an  iterative  method. 
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Appendix  F 

Visualizations  of  Learned  HPM 
Processes 


This  appendix  contains  images  of  the  processes  of  HPM-3-K  learned  using  baseline,  reg¬ 
ularized,  and  basis  function  versions  of  HPMs  on  Participant  D.  Each  section  corresponds 
to  one  of  the  methods  for  parameterizing  and  training  HPMs,  and  the  three  figures  in  each 
section  show  the  ViewPicture,  ReadSentence,  and  Decide  process  response  signatures  av¬ 
eraged  over  time.  Darker  red  voxels,  therefore,  have  higher  amplitude  on  average.  Details 
on  HPM-3-K  are  found  in  Section  5.2.1. 


F.l  Standard  HPMs 
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subject  04847,  process  comprehend  pict  known  amplitude=[-30.7  39.5] 


a 

a 

w 

4  % 

\0 

W 

w 

0 

Figure  F.l:  Learned  ViewPicture  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels 
using  standard  HPMs. 
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Figure  F.2:  Learned  ReadSentence  process  in  HPM-3-K  learned  from  all  trials  and  all 
voxels  using  standard  HPMs. 
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subject  04847,  process  make  decision  amplitude=[-26.8  34.5] 
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Figure  F.3:  Learned  Decide  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels 
using  standard  HPMs. 
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F.2  Regularized  HPMs 
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subject  04847,  process  comprehend  pict  known  amplitude=[-31 .4  40.4] 
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Figure  F.4:  Learned  ViewPicture  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels 
of  Particpant  D  using  regularized  (for  spatial  and  temporal  smoothness)  HPMs. 
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Figure  F.5:  Learned  ReadSentence  process  in  HPM-3-K  learned  from  all  trials  and  all 
voxels  of  Particpant  D  using  regularized  (for  spatial  and  temporal  smoothness)  HPMs. 
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subject  04847,  process  make  decision  amplitude=[-25.8  33.2] 
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Figure  F.6:  Learned  Decide  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels  of 
Particpant  D  using  regularized  (for  spatial  and  temporal  smoothness)  HPMs. 


156 


F.3  HPMs  with  Basis  Functions 
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subject  04847,  process  comprehend  pict  known  amplitude=[-28.8  37.0] 


a 

w 

w 

4  % 

/s 

w 

Vrf 

/\ 

W 

u 

Figure  F.7:  Learned  ViewPicture  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels 
of  Particpant  D  using  HPMs  with  basis  functions. 
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subject  04847,  process  comprehend  sent  known  amplitude=[-15.8  20.2] 


Figure  F.8:  Learned  ReadSentence  process  in  HPM-3-K  learned  from  all  trials  and  all 
voxels  of  Particpant  D  using  HPMs  with  basis  functions. 
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subject  04847,  process  make  decision  amplitude=[-25.2  32.3] 
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Figure  F.9:  Learned  Decide  process  in  HPM-3-K  learned  from  all  trials  and  all  voxels  of 
Particpant  D  using  HPMs  with  basis  functions. 
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